Early Machine Learning Innovations

I’m certainly not a scholar on the topic, but I am interested in the history of Machine Learning, and this morning, I discovered a concept known as the Fisher Information. This is the same Sir Ronald Fisher that developed the Iris Dataset in 1936, which is most certainly a Machine Learning dataset, though it predates the first true computer the ENIAC which was built in 1945. The point being that the Iris Dataset itself is way ahead of its time, using measurable characteristics of various flowers to then determine the species of the flowers. This is a deep idea, in that you have the mathematical classification of species, which I would argue goes beyond the anatomical, and brings biology into the mathematical sciences.

But on top of this, and what seem to be many other achievements I don’t know much about, he had a really clever idea regarding mutual information between variables. Specifically, how much does a given probability distribution f(X,\theta) change as a function of \theta. His answer was to look at the derivative of f as a function of \theta, though the specific formula used is a bit more complicated. Nonetheless, the basic idea is, how sensitive is a distribution to one of its parameters, and what does that tell me.

This is exactly what Machine Learning engineers do all the time, which is to test the relevance of a dimension. Just imagine you had a dataset with dimensions 1 through N, and that you have a prediction function on that dataset F(x_1, \ldots, x_N). Now imagine you add a set of weights (\theta_1, \ldots \theta_N), for \theta_i \in [0,1], so that you instead consider the function F(\theta_1 x_1, \ldots, \theta_N x_N). That is, we’ve added weights that will reduce the contribution of each dimension simply by multiplying by a constant in [0,1]. This is one of the most basic things you’ll learn in Machine Learning, and the rate of change in accuracy as a function of each \theta_i will provide information about how important each dimension is to the prediction function. This is basically what Fisher did, except almost one hundred years ago, effectively discovering a fundamental tool of Machine Learning.

The point is more than just historical, I think Machine Learning is a buzzword used to cover up the fact that a lot of this stuff was known a long time ago, and that Artificial Intelligence is, generally speaking, far more advanced than the public realizes, and that as a matter of logical implication, most of what we believe to be new and exciting breakthroughs are often mundane adaptations of existing methods and technology. The fact that so much money is being poured into the market is disturbing, because I have no idea what these people do all day.

The Inheritance of Molecular Machinery

It turns out mtDNA is inherited directly from the mother to its offspring, and I just had the idea that perhaps the molecular machines responsible for replicating DNA generally, and performing other functions within cells, are also inherited directly from the mother, to its offspring, for the simple reason that females carry the young until birth in basically all mammals. I don’t know how molecular machines are encoded in our DNA, and I haven’t looked it up if I’m being perfectly honest, but the idea is not ridiculous, again, for the simple reason, that women carry the offspring. I will look this up, and revisit the idea, but I think it’s interesting, because again (if true), like mtDNA, it suggests a fundamental asymmetry between the sexes, in terms of the contribution of DNA to offspring.

In particular, molecular machines must require a lot of information to encode, as they are effectively nano-scale robots, that perform simply astonishing functions, including the production of ATP, and the transport of molecules. In contrast, the macroscopic frame of a human being is not the complicated, as it has a simple shape, organs are probably next up in terms of complexity, bones perhaps a tie, since they have to move, organelles I don’t think are much more complicated than organs themselves. But molecular machines operate at the scale of a microprocessor, transporting single digit electrons, at a scale where interference from the natural background we live in must be occurring basically constantly, which will cause acceleration, and disrupt cellular processes, in particular with respect to the transportation of electrons.

If semen becomes too complex, then it has to become either more efficient in encoding information, or larger as a molecule, there’s no way around that, since it by definition carries the information of the paternal line. And as a result, a mechanism that allows females to transmit the most complex aspects of its offspring, makes sense, because it doesn’t have to go anywhere. I note that the ATP synthase is found on the mitochondrion, which is again inherited directly from the mother.

Higher Order Relations

I was reading my favorite book on mathematics, Mathematical Problems and Proofs, in particular, a section on basic Set Theory. The book discusses the transitive relation, where if A is related to B, and B is related to C, then A is related to C. In this case, A, B, and C are abstract mathematical objects, but you can assign practical meaning by e.g., making them all integers, and considering ordinal relationships between them, where e.g., A is greater than B, B is greater than C, and therefore, A is greater than C.  Note that this example of ordinal relationships has a “therefore” clause, but relations are abstract statements of fact, not consequences of logic. That is, we simply posit relations between objects, whereas I’ve phrased the concrete example in terms of a logical conclusion, which is very different. That is, this example is consistent with the stated set of relations among A, B, and C, which are simply posited to exist, whereas the integers have properties that imply that A is greater than C as a matter of logic.

With that introduction, it dawned on me that we can consider higher order sets of relations that probably don’t have names like “transitive”. One obvious such set of relations is as follows, where A is related B, B is related to C, C is related to D, and A is related to D. All I did was add an extra object D, and extend the relations analogously. Specifically, we can express this as a graph, where A through D are connected by a path, and A is connected directly to D by an extra edge, creating what would be a circuit in an undirected graph. Though note that even if A is related to B, this does not imply that B is related to A, and as such, any graph expressing relations is directed. This is probably known, given how simple it is, and I’m certain through my own studies that people express relations using graphs.

The interesting bit is the possibility of using machines to discover meaningful higher order relations that e.g., require at least four or more objects. Because it’s at least possible for these relations to arise over any number of objects, we can’t give them all special names in a human language like “transitive”, but a machine can. The point being that, most of mathematics is probably accessible only to machines or other sentient beings capable of handling that much information, which plainly do not inhabit this planet in any appreciable number.

Later Classical Peoples

I’ve written a lot about human history based upon mtDNA, supplemented with some archeological and anecdotal evidence. I put this together into a fairly comprehensive yet approachable presentation here. The Ancient Egyptian portion of the presentation focuses on what is commonly referred to as the “Old Kingdom”, when Egyptians plainly looked Asian, and as it turns out, the Old Kingdom mtDNA genome in the dataset is a 99% match to many modern South East Asians. The other more recent genome is instead dated to 129 to 138 AD, during the Roman Period of Ancient Egypt. They are definitely different genomes, and what’s interesting, is that this later genome basically fills the gap of the European population, with a large number of people in Europe, Northern Europe in particular, being either a 99% match to the Old Kingdom genome or a roughly 87% match to the Roman Period genome. This is not terribly controversial, since it claims that many Europeans are of either Ancient Egyptian or Roman Egyptian descent, both being major Mediterranean powers.

However, the truth is much more complicated than that, and the data suggests both groups, the Old Kingdom Egyptians, and Roman Egyptians, came from Asia. That is, Europeans are probably not Ancient Egyptians, but instead, Northern Europeans descend from the same group of people in Asia, some of whom went North, becoming Scandinavians and Germans, and some of whom went South, becoming Ancient Egyptians and Tanzanians. In particular, the Old Kingdom Ancient Egyptians seem to be related to the South East Asians, both physically and in terms of their mtDNA, whereas the later Egyptians seem to be related to Middle Eastern and Indian people, again both physically and in terms of their mtDNA. Below on the left is King and Queen Menkaure (c. 2,500 BC), both seemingly of Asian origin, and on the right is Cleopatra (c. 50 BC), plainly European, and known to be Macedonian.

The thesis here is that both people come from Asia, and that the Old Kingdom Ancient Egyptians are more closely related to the modern South East Asians, whereas the Roman Period Egyptians were possibly of Indian origin, in particular the Munda people, which is consistent with their appearance, in particular their noses. Further, the Roman Period Egyptians are an 87% match to the Swedes, Icelanders, and Dubliners, plainly suggesting the Vikings were also Indo-Aryan people. None of this is that novel, it’s just interesting that mtDNA is consistent with the notion of Indo-Aryan languages, plainly suggesting that a lot of Europeans really are of Indian origin. The novel piece is that the Ancient Egyptians are another group of Asians that came earlier, and look more South East Asian than Indian. Both bloodlines are present in modern day Europeans, in particular Northern Europeans. 

mtDNA Presentation

I’ve put together a presentation summarizing some of my work on mtDNA for a Meetup event, which is available here. Because this is a summary, it’s really easy to read, and focuses on human history, as opposed to Machine Learning or genetics. In particular, I present a very credible theory of the entire history of humanity, from Africa, to Asia, and back, using mostly Machine Learning applied to mtDNA, though supplemented with archeological evidence as well.

Enjoy!

Charles

Corrections to the Proof of Cantor’s Theorem

I was reading my absolute favorite book on mathematics, Mathematical Problems and Proofs, and it mentions Cantor’s Theorem, that the cardinality of a set is always less than the cardinality of its power set. So for example, the cardinality of |\{1,2\}| = 2, which is less than |\{ \{\emptyset \}, \{1\}, \{2\}, \{1,2\}\}| = 4. There is however no proof in that book of the general case, and instead only a proof of the finite case, expressed as a counting argument. I looked up proofs, and all the proofs I could find seem to have the same hole, which I’ll discuss.

Let A be a set, and let P(A) denote the power set of A, i.e., the set of all of its subsets. Now assume that |A| \geq |P(A)|. That is, we are assuming arguendo that the cardinality of A is not less than the cardinality of its power set, in contradiction to Cantor’s Theorem. It follows that there must be some function \psi: A \rightarrow P(A) such that \psi is surjective, which means that all elements of P(A) are mapped to by \psi, and such a function must exist, because we have assumed that |A| \geq |P(A)|. That is, because |A| \geq |P(A)|, there are enough elements in A to map to every element in P(A).

Now the next step of the proof (at least the few versions I saw this morning) all universally define the set B below, without addressing the possibility that the set is empty, though it’s possible the original proof addresses this case. My issue with this, is that there doesn’t seem to be any difference between an empty set, and a set that is provably non-existent. As such, I don’t think empty sets should be used as the basis of a proof, unless the proof follows only from the non-existence of the set, and other true theorems, true axioms, or additional assumptions (in the case of a line of reasoning being considered arguendo). In this case, proving the set in question is non-empty, changes the scope of the proof, in that it only applies to sets with a cardinality of two or greater. Most importantly, consistent with this, the accepted proof fails in the case of the power set of the empty set, and in the case of the power set of a singleton, because of this hole. This is very serious, the proof is literally wrong, and fails in two cases, that are plainly intended to be covered by the proof.

The Modified Proof

Specifically, proofs of Cantor’s theorem define B = \{x \in A | x \notin \psi(x)\}. That is, because we know \psi (defined above) must exist, we know that we can define B, using \psi. However, it’s possible that B is empty, since there might not be any such x \in A. That said, a simple additional step will prove that we can always define \hat{\psi}, which will produce a non-empty set \hat{B}.

Assume that B is empty and that |A| \geq 2. Because A \subset P(A), there must be x,y \in P(A), such that x \neq y , and x,y \in A. Because \psi is surjective, there must be a,b such that \psi(a) = x and \psi(b) = y. If a \notin x or b \notin y, then B is non-empty. As such, assume that a \in x and b \in y. Now define \hat{\psi}, such that \hat{\psi}(x) = y and \hat{\psi}(y) = x , but otherwise \hat{\psi}(z) = \psi(z) for z \notin \{x,y\}. If x and y are both singletons, then x \notin y and y \notin x. If either is not a singleton (or both are not singletons), then it must be the case that either x \notin y or y \notin x, or both. This implies that \hat{B} is not empty. Because this can be done for any surjective function \psi, we are always guaranteed a non-empty set \hat{B}. Note that \hat{\psi} is still surjective.

Now we can complete the proof as it is usually stated. Because \hat{\psi} is surjective, there must be some x_0 such that \hat{\psi}(x_0) = \hat{B} \subset P(A). It must be the case that either x_0 \in \hat{B} or x_0 \notin \hat{B}. If x_0 \in \hat{B}, then x_0 fails the criteria for inclusion in \hat{B}, namely \hat{B} = \{x \in A | x \notin \hat{\psi}(x)\}, since by definition, \hat{\psi}(x_0) = \hat{B}. If x_0 \notin \hat{B}, then x_0 satisfies the criteria for inclusion in \hat{B}. In both cases, we have a contradiction. The assumption that |A| \geq |P(A)| implies the existence of \psi, which in turn implies the existence of \hat{\psi} and \hat{B}, which leads to a contradiction. In order to resolve this contradiction, we must therefore assume instead that |A| < |P(A)|, which completes the proof.

The Case of the Empty Set

Now assume that A = \emptyset, and let’s apply the proof above. Generally speaking, we would assume that the power set of the empty set contains one element, namely a set that contains the empty set as a singleton, represented as P(A) = \{\{\emptyset\}\}. Assuming we can even define \psi in this case, it must be that \psi(\emptyset) = \{\emptyset\}. Because \emptyset \in \{\emptyset\}, it must be the case that B = \emptyset. However, if we want A \neq P(A), it must be the case that \emptyset \neq \{\emptyset\}, even though \emptyset \in \{\emptyset\}. Therefore, \emptyset \notin P(A), and instead, \emptyset \in \{\emptyset\}, and as a result, the proof fails in the case of |A| = 0 since B = \emptyset \notin P(A). That is, the accepted proof assumes B is contained in the power set, which is just not true in this case, suggesting that there really are problems deriving theorems from empty sets.

The Case of a Singleton

Now assume that A = \{x\}, and let’s apply the proof above. The power set of a singleton is generally defined as P(A) = \{ \{\emptyset\}, \{x\}\}. Because the accepted proof does not explicitly define \psi, we are free to define \psi(x) = \{x\}. Because x \in \{x\}, B = \emptyset. As noted above, \emptyset \neq \{\emptyset\}, and therefore, B \notin P(A). Again, the accepted proof fails.

Because the accepted proof fails in two cases where B is empty, my axiom above requiring sets to be non-empty in order to derive theorems, must be true. Nothing else above within the proof is subject to meaningful criticism. Again, I have no idea whether Cantor’s original proof addressed these points, but it’s surprising that no one has bothered to apply these proofs to the case of the empty set and a singleton, which would make it clear it doesn’t work.

A thought on Existence versus Substance

I’ve spent a significant amount of time thinking about how it is that the mind can consider objects that don’t seem to be obviously physically real. For example, Cantor’s cardinals don’t have any convenient corporeal form, yet there are theorems that rigorously govern their notion, that we can discern from other statements that are false about that very same topic. This, in my opinion, leads to the question of how it is that the mind, which science says should be entirely physical, can consider objects that don’t appear to be a part of 3-space, and derive apparently true statements about those objects. I think I have a sensible answer, which is that all true statements govern reality, and in that sense, they exist, even if they don’t have a location. For example, the rules of combinatorics govern at all points in space in time, and therefore exist in that sense. However, the rules of combinatorics are not embedded in some cosmic ledger, and instead, exist through the effects that they have on objects. We’re still in a bit of trouble with Cantor, because nothing within the human purview is infinite. So then we’re resigned to considering ideas that govern other ideas, under the assumption those latter ideas exist, and I’m fine with that, and perhaps that’s the right answer. But all of this points to the possibility that the human mind, distinct from the brain, is similarly without a location, and instead governs the behavior of a sentient person, including their brain. Perhaps this is a leap, but we need something to explain the fact that the mind can consider objects outside 3-space that have no finite expression, that we instead understand through reference. That is, when I write \aleph_0, I’m not encoding or representing anything, and I am instead referencing something that is in your memory as a reader, thereby conjuring a concept with which you’re already familiar, that cannot be expressed physically in finite time.

On the origins of modern human mtDNA

In my paper, A New Model of Computational Genomics [1], I introduce a simple test for ancestry that cannot credibly be argued with. The argument is as follows: assume that we begin with genome A in location a, and that three groups of individuals with genome A all begin in location a. Now assume that two of those groups go to different locations, specifically, that one group goes to location b and the other group goes to location c. Because mtDNA is so stable, it could be the case that even over significant amounts of time, the populations in locations b and c, still have genome A, with basically no mutations. If however, any mutations occur, it cannot credibly be the case that genomes in location b (genome B) and location c (genome C) develop even more bases in common with each other. This becomes increasingly unlikely as a function of the number of new matching genomes between B and C, and is governed by the binomial distribution. As a consequence, if A is the common ancestor of genomes B and C, it must be the case that |AB| < |BC| and |AC| < |BC|, where |xy| denotes the number of matching bases between genomes x and y. That is, A must have more bases in common with B and C, than B and C have in common with each other, since B and C independently mutated away from genome A.

Applying this test, we find that the Old Kingdom Ancient Egyptians are the common ancestors of basically all Northern Europeans, many Africans, Asians, and in particular, South East Asians. I’ve also noted repeatedly that the Old Kingdom Ancient Egyptians appear to be Asian, which, superficially, makes no sense. Finally, I’ve noted that Heidelbergensis plainly evolved into Phoenicians, and then the Old Kingdom Ancient Egyptians. Phoenicians appear in Asia on the maternal line, in particular in Sri Lanka.

Putting it all together, tonight I tested which population is most likely to be the ancestor of the Old Kingdom Ancient Egyptians, and the clear answer is the Sri Lankans. The attached code runs the test, and produces a normalized score. The Sri Lankans scored 17.36, and the next best answer was the Vedda Aboriginals (also in Sri Lanka), with a score of 8.3064. The plain implication is that the mutation from the Phoenician maternal line, into the Old Kingdom Ancient Egyptian maternal line took place in Sri Lanka, or somewhere very close.

This completes the history of mankind, with the people of Cameroon the likely source population of all of mankind (including the Denisovans, Heidelbergensis, and Neanderthals), Heidelbergensis then evolving into the Phoenicians, the Phoenicians traveling to Asia, there evolving into the Old Kingdom Ancient Egyptian maternal line, who then migrated back to North East Africa, forming the cradle of modern human mtDNA all over the world, suggesting they were even more successful as a people than current history suggests.

Information and the Continuum Hypothesis

Years ago when I was living in Sweden, I discovered what seems to be an original number I called \Omega, which is defined as the logarithm of \aleph_0, i.e., \Omega = \log(\aleph_0). You can read more about it in this paper [1], where I present a few theorems etc., and introduce a machine that actually has storage equal to \Omega bits. The machine would require infinite time to construct, but so does a true UTM, which has an infinite tape. Similarly, the machine described in [1] has countably many states, and therefore, \Omega is the amount of information that can be stored on any machine that has countably many states. As a consequence, \Omega has intuitive physical meaning, and therefore, we have a reason to assume it exists.

I was thinking about the number again while food shopping, and I realized it fits the continuum hypothesis perfectly. I show in [1] that there is no set with a cardinality of \Omega, and that as a logical consequence, \Omega must be something other than a number. The natural answer is that \Omega is a quantity of information. We can think about zero the same way, as 0 = \log(1), which satisfies our intuition, that zero represents no substance. Further, in this case, zero is the amount of information contained in a system that has exactly one state. This is plainly true, since a system that never changes cannot meaningfully store information, and instead simply exists. Similarly, \Omega is not the cardinality of any set, and therefore not a number, but it is nonetheless a meaningful quantity of information. As a consequence, it’s arguable we cannot perform a number of operations a number of times equal to \Omega (see below for more on this topic).

All of that said, \Omega^{\aleph_0} is defined, because that is simply the product \Omega^{\aleph_0} = \Omega \cdot \Omega \cdot \Omega \ldots, taken a countable number of times. Basic assumptions of arithmetic imply the following inequality:

2^{\aleph_0} \leq \Omega^{\aleph_0} \leq \aleph_0^{\aleph_0} = \aleph_1.

Because 2^{\aleph_0} = \aleph_1, it must be the case that \Omega^{\aleph_0} = \aleph_1.

However, taking the logarithm, we have the following:

\aleph_0 \leq \aleph_0 \log(\Omega) \leq \aleph_0 \Omega.

Corollary 1.9 of [1] proves that \Omega is unique, in that there is no other number of bits X \neq \Omega such that X > n, for all n \in \mathbb{N} and X < \aleph_0. Therefore, \log(\Omega) does not exist. As a result, we learn nothing from taking the logarithm, since \Omega is the smallest infinite quantity. However, we can conceive of a countable collection of machines with \Omega bits of storage, and you can therefore easily show that \aleph_0 \cdot \Omega = \aleph_0.

Most interestingly, it must be the case that,

\aleph_0 \leq \aleph_0^{\Omega} \leq \aleph_1.

Therefore, because of the uniqueness of \Omega, if the continuum hypothesis is true, the above inequality is the unique solution. Note that the continuum hypothesis has been proven independent from the standard axioms. Further, note that this would arguably require assuming we can perform an operation \Omega times, which is not an intuitive assumption. You could argue that we already permit the products and powers of real numbers, which are plainly not cardinals, but we can ascribe physical meaning to such processes through approximation using finite decimal rational numbers, which are the ratios of cardinal numbers. In contrast, \Omega has no approximation.

A few other thoughts, n\Omega = \Omega, for all n \in \mathbb{N}, and you can appreciate this by simply reading [1], even though it’s not proven in [1]. However, I realized tonight, it’s arguable that \Omega^2 is not defined. In contrast, 2\Omega is just the number of bits contained in two machines, and more generally, n\Omega is simply the number of bits contained in n machines. It’s not clear how to think about a collection of machines \Omega in number, precisely because there is no set that contains \Omega elements. Therefore, integer powers of \Omega are arguably undefined. However, as noted above, we can conceive of a countable collection of machines, and you can easily show that \aleph_0 \cdot \Omega = \aleph_0. I’m nearly certain I unpacked all this stuff in Sweden, in my notes, but I didn’t include it in [1], because I was working on completing Black Tree AutoML. That said, though I doubt anyone has done any work on this in the interim anyway, I discovered this stuff around 2018.