Introduction
In a previous post, I shared what I thought was a clever way of testing for ancestry, that turned out to be a failure empirically. I now understand why it doesn’t work, and it’s because I failed to consider an alternative hypothesis that is consistent with the purported facts. This produced a lot of bad data. I’ll begin by explaining how the underlying algorithmic test for ancestry works, and then explain why this instance of it failed, and close by introducing yet another test for ancestry that plainly works, and is simply amazing, allowing us to mechanically uncover the full history of mankind, using mtDNA alone.
Algorithmic Testing for Ancestry
Assume you’re given whole mtDNA genomes A, B, and C. The goal is to test whether genome A is the ancestor of both genomes B and C. It turns out, this is straight forward as a necessary (but not sufficient condition) for ancestry. Specifically, if we begin with genome A, and then posit that genomes B and C mutated independently away from genome A (e.g., groups B and C travelled to two distinct locations away from group A), then it is almost certainly the case that genomes B and C have fewer bases in common with each other, than they have in common with genome A.
For intuition, because we’ve assumed genomes B and C are mutating independently, the bases that mutate in each of B and C are analogous to two independent coins being tossed. Each mutation will reduce the number of bases in common with genome A. For example, if genome B mutates, then the number of bases that A and B have in common will be reduced. Note we are assuming genome A is static. Because B and C are mutating independently, it’s basically impossible for the number of bases in common between B and C to increase over time. Further, the rate of the decrease in common bases is almost certainly going to be higher between B and C, than between A and B, and A and C. For example, if there are 10 mutations in each of genomes B and C (i.e., a total of 20 mutations combined), then the match counts between A and B and A and C, will both decrease by exactly 10, whereas the match count between B and C should decrease by approximately 20. Let |AB| denote the match count between genomes A and B. We have then the following inequalities:
Case 1: If genome A is the common ancestor of both genomes B and C, then it is almost certainly the case that |AB| > |BC| and |AC| > |BC|. See, “A New Model of Computational Genomics” [1] for further details.
Even though this is only a necessary condition for ancestry, this pair of inequalities (coupled with a lot of research and other techniques), allowed me to put together a complete, and plausible, history of mankind [2], all the way back to the first humans in Africa.
Ancestry from Archaic Genomes
The simple insight I had, was that if A is not archaic, and B is archaic, then A can’t credibly be the ancestor of B. That is, you can’t plausibly argue that a modern human is the ancestor of some archaic human, absent compelling evidence. Further, it turns out the inequality (since it is a necessary but not sufficient condition) is also consistent with linear ancestry in two cases. Specifically, if |AB| > |BC| and |AC| > |BC|, then we can interpret this as consistent with –
Case 2: B is the ancestor of A, who is in turn the ancestor of C.
Case 3: C is the ancestor of A, who is in turn the ancestor of B.
If you plug in A = Phoenician, B = Heidelbergensis, and C = Ancient Egypt, you’ll find the inequality is satisfied for 100% of the applicable genomes in the dataset. Note that the dataset is linked to in [1]. It turns out you simply cannot tell what direction time is running given the genomes alone (unless there’s some trick I’ve missed), and so all of these claims are subject to falsification, just like science is generally. That said, if you read [2], you’ll see fairly compelling arguments consistent with common sense, that Heidelbergensis (which is an archaic human), is the ancestor of the Phoenicians, who are in turn the ancestors of the Ancient Egyptians. This is consistent with case (2) above.
Putting it all together, we have a powerful necessary condition that is consistent with ancestry, but not a sufficient condition, and it is therefore subject to falsification. However, one of these three cases is almost certainly true, if the inequalities are satisfied. The only question is which one, and as far as I can tell, you cannot determine which case is true, without exogenous information (e.g., Heidelbergensis is known to be at least 500,000 years old). You’ll note that cases (1), (2), and (3) together imply that A is always the ancestor of either B or C, or both. My initial mistake was to simply set B to an archaic genome, and assert that since A cannot credibly be the ancestor of B, it must be the case that A is the ancestor of C. Note that because A cannot credibly be the ancestor of B, Cases (1) and (3) are eliminated, leaving Case (2), which makes perfect sense: B is archaic, and is the ancestor of A, who is in turn the ancestor of C. However, this is not credible if C is also archaic, producing a lot of bad data.
Updated Ancestry Algorithm
The updated algorithm first tests literally every genome in the dataset, and asks whether it is at least a 60% match to an archaic genome, and if so, it treats that genome as archaic for purposes of the test, so that we avoid the problem highlighted above. This will allow us to reasonably assert that all tests involve exactly one archaic genome B, and therefore, we must be in Case (2). Interestingly, some archaic populations were certainly heterogenous, which is something I discussed previously. As a result, there are three ostensibly archaic genomes, that do not match to any other archaic genomes in the dataset, and they are therefore, not treated as archaic, despite their archeological classification. You can fuss with this, but it’s just three genomes out of 664, and a total of 19,972,464 comparisons. So it’s possible it moved the needle in marginal cases, but the overall conclusions reached in [2] are plainly correct, given the data this new ancestry test produced.
There is however the problem that the dataset contains only Heidelbergensis, Denisovan, and Neanderthal genomes, leaving out e.g., Homo Erectus, and potentially other unknown archaic humans. There’s nothing we can do about this, since we’re constantly finding new archaic humans. For example, Denisovans were discovered in 2010, which is pretty recent, compared to Heidelbergensis, which was discovered in 1908. Moreover, the three genomes in question are possibly three new species, since they don’t match to Denisovan, Heidelbergensis, or Neanderthals. All of that said, taken as a whole, the results produced by this new algorithm, which makes perfect theoretical sense and must be true, are consistent with the results presented in [2]. Specifically, that humans began in Africa, somewhere around present day Cameroon, migrated to the Middle East, then Asia, producing the two most evolved maternal lines that I’ve identified, somewhere around Nepal. Those two maternal lines are both found around the world, and descend from Denisovans and Heidelbergensis, respectively, suggesting that many modern humans are a mix between the most evolved maternal lines that originated in two distinct archaic human populations, effectively creating hybrids. For your reference, you can search for the Pre Roman Ancient Egyptian genome (row 320, which descends from Heidelbergensis) and the Icelandic genome (row 464, which descends from Denisovans).
The Distribution of Archaic mtDNA
When I first started studying mtDNA, I quickly realized that a lot of modern humans have archaic mtDNA. See [1] for details. This is not surprising, since mtDNA is so stable, and inherited directly from a mother to its offspring, and modern humans carry at times significant quantities of archaic DNA generally. That said, 53.01% of the genomes in the dataset test as archaic, meaning that the genome is a few hundred thousand years old, without that much change. I’ve seen studies that say some humans contain around 7% to 10% archaic DNA (on the high end). This is not exactly the same statement, since those types of studies say that around 7% to 10% of someone’s DNA could be archaic. In contrast, my work suggests that a significant majority of living human beings contain outright archaic mtDNA.
That said, I’m using whole-genome sequencing, with a single global alignment, which maximizes the differences between genomes. See [2] for more details. So it’s possible that as techniques improve, studies in other areas of the human genome will produce results similar to mine, since most researchers are (as far as I know) still focusing on genes, which are a tiny portion of the whole genome. Generally speaking, my work shows that focusing on genes is probably a mistake, that was driven by necessity since genomes are huge, and computers were slow. See [1] for empirical results that demonstrate the superiority of whole-genome analysis. I did all of this on a Mac Mini, and it runs in about 3 hours, and requires comparing all triplets of genomes, drawn from a dataset of 664 genomes (i.e., rows), where each genome has 16,579 bases (i.e., columns). This works out to
calculations, all done on a consumer device. I’d wager professional computers can now start to tackle much larger genomes using similar techniques. As a result, I think we’re going to find that a lot of people contain a lot of truly archaic DNA generally. Any argument to the contrary is sort of strange, because if people stopped selecting archaic female mates, then archaic mtDNA should have vanished, and it obviously didn’t, leading to the conclusion, that the rest of the genome likely does contain archaic DNA generally.
Below I’ve set out a list of populations ordered according to the percentage of genomes within that population that test as archaic, starting at 0% archaic, and increasing up to 100% archaic, i.e., in increasing order. The test performed was to ask, for each population, what percentage of the genomes in that population are at least a 60% match to at least one archaic genome. Again, 53.01% of the full dataset tested as archaic, and as you’ll see below, several modern populations consist of only archaic mtDNA (i.e., 100% of the genomes are a 60% match to at least one archaic genome). One immediate takeaway, is that the classical world seems to have absolutely no archaic mtDNA. I’ve also noted that the Ancient Roman maternal line seems to have been annihilated, which was almost certainly deliberate.
Finally, I’ll note that preliminary results suggest that the Ancient Roman maternal line (which again, no longer exists, anywhere in the world) seems to be the most evolved maternal line in the entire dataset.
The code is attached to the bottom of the post.
0% Archaic
Ancient Egyptian
Ancient Roman
Basque
Phoenecian
Saqqaq
Thai
Igbo
Icelandic
Hawaiin
Dublin
Sri Lanka
10% to 49% archaic
Polish
Sardinian
Tanzania
Korean
German
Swedish
Scottish
Nepalese
Japanese
Sami
Filipino
Dutch
Spanish
Sephardic
Belarus
Norwegian
Egyptian
Finnish
Pashtun
Ukrainian
Irish
French
Portuguese
Danish
English
50% to 99% Archaic
Chinese
Maritime Archaic
Georgian
Munda
Nigerian
Ashkenazi
Mongolian
Hungarian
Ancient Finnish
Greek
Russian
Mexican
Vedda Abor.
Italian
Turkish
Chachapoya
Khoisan
Neanderthal
Uyghur
Kenyan
Indian
Saudi
Kazakh
Denisovan
Mayan
Taiwanese
100% Archaic
Iberian Roma
Heidelbergensis
Papau New Guinea
Ancient Bulgarian
Ancient Chinese
Sol. Islands
Indonesian
Andamanese
Iranian
Ancient Khoisan
Javanese
Jarkhand
Cameroon