Measuring Uncertainty in Ancestry

In my paper, A New Model of Computational Genomics [1], I presented an algorithm that can test whether one mtDNA genome is the common ancestor of two other mtDNA genomes. The basic theory underlying the algorithm is straightforward, and cannot be argued with:

Given genomes A, B, and C, if genome A is the ancestor of genomes B and C, then it must be the case that genomes A and B, and A and C, have more bases in common than genomes B and C. This is a relatively simple fact of mathematics, that you can find in [1], specifically, in footnote 16. However, you can appreciate the intuition right away: imagine two people tossing coins simultaneously, and writing down the outcomes. Whatever outcomes they have in common (e.g., both throwing heads), will be the result of chance. For the same reason, if you start with genome A, and you allow it to mutate over time, producing genomes B and C, whatever bases genomes B and C have in common will be the result of chance, and as such, they should both mutate away from genome A, rather than developing more bases in common with each other by chance. This will produce the inequalities |AB| > |BC| and |AC| > |BC|, where |AB| denotes the number of bases genomes A and B have in common.

For the same reason, if you count the number of matches between two populations at a fixed percentage of the genome, the match counts between populations A, B, and C, should satisfy the same inequalities, for the same reason. For example, fix the matching threshold to 30% of the full genome, and then count the number of genomes between populations A and B that are at least a 30% match or more to each other. Do the same for A and C, and B and C. However, you’ll have to normalize this to an [0,1] scale, otherwise your calculations will be skewed by population size. My software already does this, so there’s nothing to do on that front.

By iteratively applying the population-level test for different values of M, we can also generate a measure of uncertainty associated with our observation. That is, not only can we test whether the inequalities are satisfied, we can also generate a measure of uncertainty associated with the test.

Specifically, fix M to some minimum value, which we select as 30% of the full genome size N, given that 25% is the expected matching percentage produced by chance, and 30% is meaningfully far from chance (again, see Footnote 16 of [1]). Further, note that as M increases, our confidence that the matches between A and B and A and C, are not the result of chance, increases. For intuition, note that as we increase M, the set of matching genomes can only grow smaller. Similarly, our confidence that the non-matching genomes between B and C will not be the result of chance decreases as a function of M. For intuition, note that as we increase M, the set of non-matching genomes can only grow larger.

As a result the minimum value for which the inequalities are satisfied informs our confidence in the B to C test, and the maximum value of M for which the inequalities are satisfied informs our confidence in the A to B and A to C tests. Specifically, the probability the B to C test is the result of chance is informed by the difference between the minimum M – N25%, whereas the A to B and A to C tests are informed by the difference N – M, where M is the maximum M. Note this difference is literally some number of bases, that is in turn associated with a probability (see again, Footnote 16 in [1]), and a measure of Uncertainty (see Section 3.1 of [1]). This allows us to first test whether or not a given population is the common ancestor of two other populations, and then further, assign a value of Uncertainty to that test.

Phoenicians as Common Ancestor

In a previous article, I showed that the people of Cameroon test as the ancestors of Heidelbergensis, Neanderthals, and Denisovans, with respect to their mtDNA. The obvious question is, how is it that archaic humans are still alive today? The answer is that they’re probably not truly archaic humans, but that their mtDNA is truly archaic. This is possible for the simple reason that mtDNA is remarkably stable, and can last for thousands of years without changing much at all. However, there’s still the question of where modern humans come from, i.e., is there a group of people that test as the common ancestors of modern human populations. The answer is yes, and it’s the Phoenicians, in particular, a group of mtDNA genomes found in Puig des Molins. Astonishingly, the Phoenicians test as the common ancestor of the Pre-Roman Egyptians (perhaps not terribly astonishing), and the modern day Thai and Sri Lankans, the latter two being simply incredible, and perhaps requiring a reconsideration of purported history.

The overall test is straight forward, and cannot be argued with: Given genomes A, B, and C, if genome A is the ancestor of genomes B and C, then it must be the case that genomes A and B, and A and C, have more bases in common than genomes B and C. This is a relatively simple fact of mathematics, that you can find in my paper, A New Model of Computational Genomics [1], specifically, in footnote 16. However, you can appreciate the intuition right away: imagine two people tossing coins simultaneously, and writing down the outcomes. Whatever outcomes they have in common (e.g., both throwing heads), will be the result of chance. For the same reason, if you start with genome A, and you allow it to mutate over time, producing genomes B and C, whatever bases genomes B and C have in common will be the result of chance, and as such, they should both mutate away from genome A, rather than developing more bases in common with each other by chance. This will produce the inequalities |AB| > |BC| and |AC| > |BC|, where |AB| denotes the number of bases genomes A and B have in common.

For the same reason, if you count the number of matches between two populations at a fixed percentage of the genome, the match counts between populations A, B, and C, should satisfy the same inequalities, for the same reason. For example, fix the matching threshold to 30% of the full genome, and then count the number of genomes between populations A and B that are at least a 30% match or more to each other. Do the same for A and C, and B and C. However, you’ll have to normalize this to an [0,1] scale, otherwise your calculations will be skewed by population size. My software already does this, so there’s nothing to do on that front.

In this case, I’ve run several tests, all of which use the second population-level method described above. We begin by showing that the Phoenicians are the common ancestor of the modern day Sri Lankans and Sardinians. For this, set the minimum match count to 99.65% of the full genome size. This will produce a normalized score of 0.833 between the Phoenicians and Sri Lankans, and 0.800 between the Phoenicians and Sardinians. However, the score between the Sri Lankans and the Sardinians is 0.200, which plainly satisfies the inequality. This is consistent with the hypothesis that the Phoenician maternal line is the ancestor of both the modern day Sri Lankans and Sardinians. Setting the minimum match count to 88.01% of the genome, we find that the score between the Phoenicians and the Pre-Roman Egyptians is 0.500, and the score between the Phoenicians and the Sri Lankans is 1.000. The score between the Pre-Roman Egyptians and the Sri Lankans is instead 0.000, again satisfying the inequality. This is consistent with the hypothesis that the Phoenicians are the common ancestor of both the Pre-Roman Egyptians and the modern day Sri Lankans.

This seems peculiar, since the Phoenicians are Middle Eastern people, and the genomes in question are from Ibiza. However, the Phoenicians in particular were certainly sea-faring people, and moreover, civilization in the Middle East goes back to at least Ugarit, which could date as far back as 6,000 BC. Though not consistent with purported history, this at least leaves open the possibility that people from the Middle East traveled to South Asia. This might sound too ambitious for the time, but the Phoenicians made it to Ibiza from the Middle East, which is roughly the same distance as the Middle East to Sri Lanka, both of which are islands. Once you’re in South Asia, the rest of the region becomes accessible.

If this is true, then it shouldn’t be limited to Sri Lanka, and this is in fact the case. In particular, the Thai also test as the descendants of the Phoenicians, using the same analysis. Even more interesting, both the modern day Norwegians, Swedes, and Finns test as the descendants of the Thai, again using the same analysis. Putting it all together, it seems plausible that early Middle Eastern civilizations not only visited but settled South Asia, and that some of them came back, in particular to Egypt, and Scandinavia. This could explain why the Pre-Roman Egyptians are visibly Asian people, and further, why Thai-style architecture exists in early Scandinavia. Though the latter might sound totally implausible, it is important to note that some Thai and Norwegian people are nearly identical on the maternal line, with about 99.6% of the genome matching. Something has to explain that. Also note that the Sri Lankan maternal line was present throughout Europe around 33,000 BC. This suggests plainly that many Europeans, and the Classical World itself, descend from the Phoenicians. That somewhat remote populations also descend from them is not too surprising, in this context.

Further, there are alarming similarities between the Nordic religions and alphabet, and the Canaanite religions and alphabet, in particular, the gods El / Adon and Odin, with their sons, Baal and Baldur, respectively. Once you place greater emphasis on genetic history, over written history, this story sounds perfectly believable. Further still, if people migrated back from South Asia to the West, then this should again not be limited to Scandinavia, and this is in fact the case. Astonishingly, the Pre-Roman Egyptians test as the descendants of the Thai people, using the same analysis. Obviously the Pre-Roman Egyptians were not the first Africans, and in fact, everything suggests they’re South Asian, and for the same reason, none of this implies that modern day Scandinavians are the first Scandinavians, and instead, again, it looks like many Norwegians and Finns are instead, again, South Asian.

Finally, this is all consistent with the obvious fact that the most advanced civilizations in the world, i.e., the Classical World, are all proximate to the Middle East, suggesting that the genesis of true human intelligence, could have come from somewhere near Phoenicia.

On the improbability of reproductive selection for drastic evolution

Large leaps in evolution seem to require too much time to make sense. Consider the fact that about 500 bases separate human mtDNA from that of a gorilla or a chimp. That’s a small percentage of the approximately 16,000 bases that make up human mtDNA, but the number of sequences that are 500 bases in length is 4^ 500, which has approximately 300 digits. As a consequence, claiming that reproductive selection, i.e., the birth of some large number of children, that were then selected for fitness by their environment, is the driver of the change from ape to man, makes no sense, as there’s simply not enough time or offspring for that to be a credible theory, for even this small piece of its machinery, which is the evolution of mtDNA.

However, if we allow for evolution at the cellular level in the individual, over the lifetime of the individual, then it could explain how e.g., 500 extra bases end up added to the mtDNA of a gorilla, since there are trillions of cells in humans. That is, floating bases are added constantly, as insertions, in error, and when lethal, the cell in question dies off. However, if not lethal, and instead beneficial, this could occur throughout the body of the organism, causing the organism to evolve within its own lifetime, by e.g., changing its mtDNA through such a large scale, presumably beneficial insertion, like the one that divides apes from humanity.

This implies four corollaries:

1. It is far more likely that any such benefits will be passed on from the paternal line, since men constantly produce new semen. In contrast, women produce some fixed number of eggs by a particular age. As a result, men present more opportunities to pass down mutations of this type, if those mutations also impact their semen.

2. There must be some women who are capable of producing “new eggs” after a mutation, otherwise the mutation that caused gorilla mtDNA to evolve into human mtDNA, wouldn’t persist.

3. If you argue instead that such drastic mutations occur in the semen or the eggs, then you again have the problem of requiring too much time, since it would require a large number of offspring, that are then selected for lethal and non-lethal traits. This is the same argument we dismissed above. That is, the number of possible 500 base insertions is too large for this to be a credible theory. As a consequence, drastic mutations cannot be the result of reproductive selection, period, and require another explanation, for which cellular mutations within the individual seem a credible candidate.

4. If true, then it implies the astonishing possibility of evolution within the lifetime of an individual. This sounds far fetched, but cancer is a reality, and is a failure at the cellular level, that causes unchecked growth. The argument above implies something similar, but beneficial, that occurs during the lifetime of an individual, permeating its body, and thereby impacting its offspring.

Denisovan as Common Ancestor, Revisited

In a previous note, I showed that the Denisovans appear to be the common ancestor of both Heidelbergensis and Neanderthals, in turn implying that they are the first humans. Since writing that note, I’ve expanded the dataset significantly, and it now includes the people of Cameroon. I noticed a while back that the people of Cameroon are plainly of Denisovan ancestry. Because it’s commonly accepted that humanity originated in Africa, the Cameroon are therefore a decent candidate for being related to the first humans.

It turns out, when you test Cameroon mtDNA, it seems like they’re not only related to the first humans, they are in fact the first humans, and test as the ancestors of the Denisovans, Heidelbergensis, and the Neanderthals. You might ask how it’s possible that archaic humans survived this long. The answer is, mtDNA is remarkably stable, and so while the people of Cameroon are almost certainly not a perfect match to the first humans, it seems their mtDNA could be really close, since they predate all the major categories of archaic humans with respect to their mtDNA.

The overall test is straight forward, and cannot be argued with: Given genomes A,B, and C, if genome A is the ancestor of genomes B and C, then it must be the case that genomes A and B, and A and C, have more bases in common than genomes B and C. This is a relatively simple fact of mathematics, that you can find in my paper, A New Model of Computational Genomics [1], specifically, in footnote 16. However, you can appreciate the intuition right away: imagine two people tossing coins simultaneously, and writing down the outcomes. Whatever outcomes they have in common (e.g., both throwing heads), will be the result of chance. For the same reason, if you start with genome A, and you allow it to mutate over time, producing genomes B and C, whatever bases genomes B and C have in common will be the result of chance, and as such, they should both mutate away from genome A, rather than developing more bases in common with each other by chance. This will produce the inequalities |AB| > |BC| and |AC| > |BC|, where |AB| denotes the number of bases genomes A and B have in common.

For the same reason, if you count the number of matches between two populations at a fixed percentage of the genome, the match counts between populations A, B, and C, should satisfy the same inequalities, for the same reason. For example, fix the matching threshold to 30%, and then count the number of genomes between populations A and B that are at least a 30% match or more to each other. Do the same for A and C, and B and C. However, you’ll have to normalize this to an [0,1] scale, otherwise your calculations will be skewed by population size. My software already does this, so there’s nothing to do on that front.

If it is the case that populations B and C evolved from population A, then the number of matches between A and B and A and C, should exceed the number of matches between B and C. The mathematics is not as obvious in this case, since you’re counting matching genomes, rather than matching bases, but the intuition is the same. Just imagine beginning with population A, and replicating it in populations B and C. In this initial state, the number of matching genomes between A and B, A and C, and B and C, are equal, since they’ve yet to mutate away from A (i.e., they are all literally the same population). As populations B and C mutate, the number of matching genomes between B and C should only go down as a function of time, since the contrary would require an increase in the number of matching bases between the various genomes, which is not possible at any appreciable scale. Again, see [1] for details.

In the first note linked to above, I show that the Denisovans are arguably the common ancestors of both Heidelbergensis and the Neanderthals. However, if you use the same code to test the Cameroon, you’ll find that they test as the common ancestor of the Denisovans, Heidelbergensis, and the Neanderthals. This is just not true of other populations that are related to Denisovans. For example, I tested the Kenyans, the Finns, and the Mongolians, all of which have living Denisovans in their populations (at least with respect to their mtDNA) and they all fail the inequalities. Now, there could be some other group of people that are even more archaic than the Cameroon, but the bottom line is, this result is perfectly consistent with the notion that humans originated in Africa, migrated to Asia, and then came back to both Europe and Africa, since e.g., about 10% of Kenyans are a 99% match to South Koreans and Hawaiians, and the Pre-Roman Ancient Egyptians were visibly Asian people, and about 40% of South Koreans are a 99% match to the Pre-Roman Ancient Egyptians.

The updated dataset that includes the Cameroons, and others, is available here. You’ll have to update the command line code in [1] to include the additional ethnicities, but it’s a simple copy / paste exercise, which you’ll have to do anyway to change the directories to match where you save the data on your machine.