Early Machine Learning Innovations

I’m certainly not a scholar on the topic, but I am interested in the history of Machine Learning, and this morning, I discovered a concept known as the Fisher Information. This is the same Sir Ronald Fisher that developed the Iris Dataset in 1936, which is most certainly a Machine Learning dataset, though it predates the first true computer the ENIAC which was built in 1945. The point being that the Iris Dataset itself is way ahead of its time, using measurable characteristics of various flowers to then determine the species of the flowers. This is a deep idea, in that you have the mathematical classification of species, which I would argue goes beyond the anatomical, and brings biology into the mathematical sciences.

But on top of this, and what seem to be many other achievements I don’t know much about, he had a really clever idea regarding mutual information between variables. Specifically, how much does a given probability distribution f(X,\theta) change as a function of \theta. His answer was to look at the derivative of f as a function of \theta, though the specific formula used is a bit more complicated. Nonetheless, the basic idea is, how sensitive is a distribution to one of its parameters, and what does that tell me.

This is exactly what Machine Learning engineers do all the time, which is to test the relevance of a dimension. Just imagine you had a dataset with dimensions 1 through N, and that you have a prediction function on that dataset F(x_1, \ldots, x_N). Now imagine you add a set of weights (\theta_1, \ldots \theta_N), for \theta_i \in [0,1], so that you instead consider the function F(\theta_1 x_1, \ldots, \theta_N x_N). That is, we’ve added weights that will reduce the contribution of each dimension simply by multiplying by a constant in [0,1]. This is one of the most basic things you’ll learn in Machine Learning, and the rate of change in accuracy as a function of each \theta_i will provide information about how important each dimension is to the prediction function. This is basically what Fisher did, except almost one hundred years ago, effectively discovering a fundamental tool of Machine Learning.

The point is more than just historical, I think Machine Learning is a buzzword used to cover up the fact that a lot of this stuff was known a long time ago, and that Artificial Intelligence is, generally speaking, far more advanced than the public realizes, and that as a matter of logical implication, most of what we believe to be new and exciting breakthroughs are often mundane adaptations of existing methods and technology. The fact that so much money is being poured into the market is disturbing, because I have no idea what these people do all day.

Corrections to the Proof of Cantor’s Theorem

I was reading my absolute favorite book on mathematics, Mathematical Problems and Proofs, and it mentions Cantor’s Theorem, that the cardinality of a set is always less than the cardinality of its power set. So for example, the cardinality of |\{1,2\}| = 2, which is less than |\{ \{\emptyset \}, \{1\}, \{2\}, \{1,2\}\}| = 4. There is however no proof in that book of the general case, and instead only a proof of the finite case, expressed as a counting argument. I looked up proofs, and all the proofs I could find seem to have the same hole, which I’ll discuss.

Let A be a set, and let P(A) denote the power set of A, i.e., the set of all of its subsets. Now assume that |A| \geq |P(A)|. That is, we are assuming arguendo that the cardinality of A is not less than the cardinality of its power set, in contradiction to Cantor’s Theorem. It follows that there must be some function \psi: A \rightarrow P(A) such that \psi is surjective, which means that all elements of P(A) are mapped to by \psi, and such a function must exist, because we have assumed that |A| \geq |P(A)|. That is, because |A| \geq |P(A)|, there are enough elements in A to map to every element in P(A).

Now the next step of the proof (at least the few versions I saw this morning) all universally define the set B below, without addressing the possibility that the set is empty, though it’s possible the original proof addresses this case. My issue with this, is that there doesn’t seem to be any difference between an empty set, and a set that is provably non-existent. As such, I don’t think empty sets should be used as the basis of a proof, unless the proof follows only from the non-existence of the set, and other true theorems, true axioms, or additional assumptions (in the case of a line of reasoning being considered arguendo). In this case, proving the set in question is non-empty, changes the scope of the proof, in that it only applies to sets with a cardinality of two or greater. Most importantly, consistent with this, the accepted proof fails in the case of the power set of the empty set, and in the case of the power set of a singleton, because of this hole. This is very serious, the proof is literally wrong, and fails in two cases, that are plainly intended to be covered by the proof.

The Modified Proof

Specifically, proofs of Cantor’s theorem define B = \{x \in A | x \notin \psi(x)\}. That is, because we know \psi (defined above) must exist, we know that we can define B, using \psi. However, it’s possible that B is empty, since there might not be any such x \in A. That said, a simple additional step will prove that we can always define \hat{\psi}, which will produce a non-empty set \hat{B}.

Assume that B is empty and that |A| \geq 2. Because A \subset P(A), there must be x,y \in P(A), such that x \neq y , and x,y \in A. Because \psi is surjective, there must be a,b such that \psi(a) = x and \psi(b) = y. If a \notin x or b \notin y, then B is non-empty. As such, assume that a \in x and b \in y. Now define \hat{\psi}, such that \hat{\psi}(x) = y and \hat{\psi}(y) = x , but otherwise \hat{\psi}(z) = \psi(z) for z \notin \{x,y\}. If x and y are both singletons, then x \notin y and y \notin x. If either is not a singleton (or both are not singletons), then it must be the case that either x \notin y or y \notin x, or both. This implies that \hat{B} is not empty. Because this can be done for any surjective function \psi, we are always guaranteed a non-empty set \hat{B}. Note that \hat{\psi} is still surjective.

Now we can complete the proof as it is usually stated. Because \hat{\psi} is surjective, there must be some x_0 such that \hat{\psi}(x_0) = \hat{B} \subset P(A). It must be the case that either x_0 \in \hat{B} or x_0 \notin \hat{B}. If x_0 \in \hat{B}, then x_0 fails the criteria for inclusion in \hat{B}, namely \hat{B} = \{x \in A | x \notin \hat{\psi}(x)\}, since by definition, \hat{\psi}(x_0) = \hat{B}. If x_0 \notin \hat{B}, then x_0 satisfies the criteria for inclusion in \hat{B}. In both cases, we have a contradiction. The assumption that |A| \geq |P(A)| implies the existence of \psi, which in turn implies the existence of \hat{\psi} and \hat{B}, which leads to a contradiction. In order to resolve this contradiction, we must therefore assume instead that |A| < |P(A)|, which completes the proof.

The Case of the Empty Set

Now assume that A = \emptyset, and let’s apply the proof above. Generally speaking, we would assume that the power set of the empty set contains one element, namely a set that contains the empty set as a singleton, represented as P(A) = \{\{\emptyset\}\}. Assuming we can even define \psi in this case, it must be that \psi(\emptyset) = \{\emptyset\}. Because \emptyset \in \{\emptyset\}, it must be the case that B = \emptyset. However, if we want A \neq P(A), it must be the case that \emptyset \neq \{\emptyset\}, even though \emptyset \in \{\emptyset\}. Therefore, \emptyset \notin P(A), and instead, \emptyset \in \{\emptyset\}, and as a result, the proof fails in the case of |A| = 0 since B = \emptyset \notin P(A). That is, the accepted proof assumes B is contained in the power set, which is just not true in this case, suggesting that there really are problems deriving theorems from empty sets.

The Case of a Singleton

Now assume that A = \{x\}, and let’s apply the proof above. The power set of a singleton is generally defined as P(A) = \{ \{\emptyset\}, \{x\}\}. Because the accepted proof does not explicitly define \psi, we are free to define \psi(x) = \{x\}. Because x \in \{x\}, B = \emptyset. As noted above, \emptyset \neq \{\emptyset\}, and therefore, B \notin P(A). Again, the accepted proof fails.

Because the accepted proof fails in two cases where B is empty, my axiom above requiring sets to be non-empty in order to derive theorems, must be true. Nothing else above within the proof is subject to meaningful criticism. Again, I have no idea whether Cantor’s original proof addressed these points, but it’s surprising that no one has bothered to apply these proofs to the case of the empty set and a singleton, which would make it clear it doesn’t work.

Information and the Continuum Hypothesis

Years ago when I was living in Sweden, I discovered what seems to be an original number I called \Omega, which is defined as the logarithm of \aleph_0, i.e., \Omega = \log(\aleph_0). You can read more about it in this paper [1], where I present a few theorems etc., and introduce a machine that actually has storage equal to \Omega bits. The machine would require infinite time to construct, but so does a true UTM, which has an infinite tape. Similarly, the machine described in [1] has countably many states, and therefore, \Omega is the amount of information that can be stored on any machine that has countably many states. As a consequence, \Omega has intuitive physical meaning, and therefore, we have a reason to assume it exists.

I was thinking about the number again while food shopping, and I realized it fits the continuum hypothesis perfectly. I show in [1] that there is no set with a cardinality of \Omega, and that as a logical consequence, \Omega must be something other than a number. The natural answer is that \Omega is a quantity of information. We can think about zero the same way, as 0 = \log(1), which satisfies our intuition, that zero represents no substance. Further, in this case, zero is the amount of information contained in a system that has exactly one state. This is plainly true, since a system that never changes cannot meaningfully store information, and instead simply exists. Similarly, \Omega is not the cardinality of any set, and therefore not a number, but it is nonetheless a meaningful quantity of information. As a consequence, it’s arguable we cannot perform a number of operations a number of times equal to \Omega (see below for more on this topic).

All of that said, \Omega^{\aleph_0} is defined, because that is simply the product \Omega^{\aleph_0} = \Omega \cdot \Omega \cdot \Omega \ldots, taken a countable number of times. Basic assumptions of arithmetic imply the following inequality:

2^{\aleph_0} \leq \Omega^{\aleph_0} \leq \aleph_0^{\aleph_0} = \aleph_1.

Because 2^{\aleph_0} = \aleph_1, it must be the case that \Omega^{\aleph_0} = \aleph_1.

However, taking the logarithm, we have the following:

\aleph_0 \leq \aleph_0 \log(\Omega) \leq \aleph_0 \Omega.

Corollary 1.9 of [1] proves that \Omega is unique, in that there is no other number of bits X \neq \Omega such that X > n, for all n \in \mathbb{N} and X < \aleph_0. Therefore, \log(\Omega) does not exist. As a result, we learn nothing from taking the logarithm, since \Omega is the smallest infinite quantity. However, we can conceive of a countable collection of machines with \Omega bits of storage, and you can therefore easily show that \aleph_0 \cdot \Omega = \aleph_0.

Most interestingly, it must be the case that,

\aleph_0 \leq \aleph_0^{\Omega} \leq \aleph_1.

Therefore, because of the uniqueness of \Omega, if the continuum hypothesis is true, the above inequality is the unique solution. Note that the continuum hypothesis has been proven independent from the standard axioms. Further, note that this would arguably require assuming we can perform an operation \Omega times, which is not an intuitive assumption. You could argue that we already permit the products and powers of real numbers, which are plainly not cardinals, but we can ascribe physical meaning to such processes through approximation using finite decimal rational numbers, which are the ratios of cardinal numbers. In contrast, \Omega has no approximation.

A few other thoughts, n\Omega = \Omega, for all n \in \mathbb{N}, and you can appreciate this by simply reading [1], even though it’s not proven in [1]. However, I realized tonight, it’s arguable that \Omega^2 is not defined. In contrast, 2\Omega is just the number of bits contained in two machines, and more generally, n\Omega is simply the number of bits contained in n machines. It’s not clear how to think about a collection of machines \Omega in number, precisely because there is no set that contains \Omega elements. Therefore, integer powers of \Omega are arguably undefined. However, as noted above, we can conceive of a countable collection of machines, and you can easily show that \aleph_0 \cdot \Omega = \aleph_0. I’m nearly certain I unpacked all this stuff in Sweden, in my notes, but I didn’t include it in [1], because I was working on completing Black Tree AutoML. That said, though I doubt anyone has done any work on this in the interim anyway, I discovered this stuff around 2018.

Bilateral Comparison of Ancestry Flows

In a previous note, I presented an algorithm that allows you to test ancestry flows between three populations. The method in the previous note already allows for bilateral comparisons, between the three populations. However, if you repeatedly apply the trilateral test, using all possible triplets from a dataset of populations, you will produce a graph. This graph will have bilateral flows using information derived from the entire dataset, as opposed to just three populations.

I’m still mulling through the data, but in the previous note, I stated that Norway seems to be the root population for basically everyone. I now think it’s somewhere around Holland and Denmark, using the full graph produced by the algorithm below. This is not to undermine the hypothesis that human life began in Africa, instead, the hypothesis is that modern homosapiens seem to have emerged pretty recently, in Northern Europe. All of this stuff needs to be squared off with other known results, in particular archeological results, but it does explain how, e.g., South East Asians have the gene for light skin, by descendancy (i.e., they’re the ancestors of white Europeans). I’m merely expanding the claim, pointing out that a lot of Africans also test as the descendants of Europeans, using mtDNA.

Here’s the code, more to come. You just call it from the command line saying “graph_matrix = generate_full_ancestry_graph(dataset, num_classes, N);”. The resultant graph flows are stored in graph_matrix. You’ll need the rest of the code, which is included in the paper I link to in the previous note.

mtDNA and IQ

Introduction

I’ve noticed in the past that Finns have significantly higher IQ’s than the Swedes and Norwegians. This is in my opinion the group of people to study if you’re interested in the nature of intelligence, because they’re all very similar people, from roughly equally rich nations, in the same part of the world, which should allow innate ability to take control. One notable difference is that the Finns speak an Uralic language, whereas the Norwegians and Swedes speak a Germanic language. There could be something to this, but investigating the problem again today led me to what seems an inescapable conclusion, that whatever the connection is between mtDNA and intelligence, it simply cannot account for the distribution of IQ as it exists.

Instead I now believe that brain structure is the most important factor in intelligence, which simply cannot be controlled by mtDNA in any credible way. Specifically, my thinking is rooted in algorithmic complexity, that if you have two equally powered machines, running different algorithms that accomplish the same task, then the machine with the more efficient algorithm of the two will be the most powerful of the two. Translated to biology, if you have two brains that both consume the same amount of power per unit of time, and have the same “clock rate”, one brain could still be vastly more powerful than the other, due simply to different structure. This could explain e.g., the fact that some birds can talk, whereas some dogs will eat until they vomit, despite the fact that birds have brain volumes that are a small fraction of a dog’s brain volume.

mtDNA and Intelligence

Despite the apparent complexity of the subject, this is going to be a short note, because the idea that mtDNA controls for IQ is apparently nonsense, despite the scholarship on the topic (not picking on anyone, but here’s a decent article that runs through some credible arguments for the role of mtDNA in intelligence). But as you’ll see, whole-genome sequencing throws the argument in the garbage.

There’s no nice way to say this, but the Roma people have pretty low IQs, but what’s most interesting about them, is that they are basically identical to each other, and all other people of that maternal line, including about 100% of Papuans, 67% of Russians, and about 30% of Taiwanese people. If you want to test the results yourself, you can see my paper, “A New Model of Computational Genomics” [1], which includes all the software, and a detailed walkthrough to explain how I end up with these numbers. At a high level, the Papuans, Russians, and Taiwanese people in this group of Roma lineage, are all a 99% match to the Iberian Roma, with respect to their mtDNA. If mtDNA controlled intelligence, then all of those populations should have similarly low IQ’s, since they’re basically identical to the Roma. This is just not true, and instead the Taiwanese have around the highest and second highest IQ on Earth, and the Russians have roughly the same IQ as the Norwegians and Swedes, despite the fact that Russia is, quite frankly, poor and dysfunctional compared to Norway and Sweden.

One important note, though you’ll often hear that “humans are 98% monkey”, or some nonsense like that, the algorithms in [1] use what’s called a global alignment, and as a consequence, they’re extremely sensitive to changes in position, causing e.g., the Roma to have little more than chance in common with some people (i.e., about 25% of the mtDNA bases). This sensitivity is probably why the software in [1] is so powerful, and is able to predict ethnicity with about 80% accuracy, using mtDNA alone (which is pretty amazing). In contrast, NIH’s BLAST algorithm uses a local alignment, and so it deliberately seeks to maximize the number of matching bases, by shifting two genomes around, causing everyone to look the same, and therefore, throwing away valuable information about the genome.

Getting back to the core topic, if you pay attention to this limited set of facts, mtDNA is in the garbage as a driver of intelligence, and moreover, the role of poverty is not exactly clear either, since Russia is really poor compared to Norway and Sweden, and yet they have roughly the same IQs. So what is driving this? Cynically, I think IQ testing is really just testing for basic education (when you look at a map), which is absent in the truly poorest countries, but that doesn’t mean that we can’t debunk the connection between mtDNA and intelligence. But to be clear, I do think intelligence is genetic, and in anomalous cases like Finland, Cambodia, and Suriname, IQ becomes something interesting, because it’s at least a test. I just doubt it’s mtDNA driving the bus.

Some Answers from Computer Science

Even if we posit arguendo (which is not very nice) that there’s something wrong with Roma mtDNA, this would simply imply that they have low energy per unit of time, perhaps as a function of fixed caloric intake and environment. To make this less abstract, let’s fix a Norwegian guy (not Roma) and a Russian guy (Roma), and give them the same food, education, climate, environment, clothes, etc., over a lifetime. Under this assumption, the Russian guy will produce less energy over his lifetime, and therefore, his brain has a lower output. But this is garbage as an argument, for mechanical reasons: if the Russian guy has a more efficient brain, then he doesn’t need as much power to run his brain. As a consequence, his output over a lifetime could in fact be higher.

To make things completely concrete, if you use a brute force method to sort a list of 10 letters, you’ll have to perform 10! = 3,628,800 calculations. If you instead use my parallel method, you’ll have to make between 3 and 4 calculations. As you can plainly see, there is an ocean between these two approaches to solving even the simple problem of sorting a list. As a consequence, the most sensible answer is, in my opinion, that brain structure controls for intelligence, for the simple reason, that it encodes the algorithms we use to solve the problems we face every day. Some people have fast ones, some people have dumb ones, and then there’s (probably) most people in the middle.

Returning to the birds versus dogs analogy, I think it’s not ridiculous to argue that birds have vastly more efficient brains than dogs, that something along the lines of computational efficiency is taking place in the brain of a bird, that allows it to perform complex tasks, with a smaller, presumably lower-energy brain. For the same reasons, this could explain the obvious fact that some people are wildly more intelligent than others, despite (possibly) having the same maternal line. Because intelligence varies within a given ethnicity, I can tell you that you are e.g., Norwegian, with high accuracy using just your mtDNA, but there’s no way of knowing (to my knowledge) whether you’re one of the dumb ones. This doesn’t preclude identifying deficiencies in mtDNA that will make you dangerously ill, and therefore not very bright, but it just doesn’t make sense that the means of power-production controls the most complex structure in the Universe –

It’s a single bean, in an ocean of genetic information.