Computing Ancestry

I presented an algorithm that builds a graph showing possible ancestral connections among genomes, which you can find in Section 6 of my paper, A New Model of Computational Genomics. The basic idea is that given genomes A, B, and C, if genome A is the ancestor genome of both genomes B and C, then it is almost certainly the case (as a matter of probability) that genomes A and B, and A and C, will have more in common with each other than genomes B and C. This is for the simple reason that it is far more likely that both genomes B and C will mutate away from genome A, divergently, than it is that both B and C will somehow spontaneously develop common bases.

For exactly the same reason, if two genomes A and B have more than 25% of their bases in common (i.e., chance), but less than 100% of their bases in common, then they almost certainly have an ancestral connection. Specifically, there are exactly three possibilities: (i) genomes A and B have a common ancestor; (ii) genome A is the ancestor of genome B; (iii) genome B is the ancestor of genome A. You can’t say which is the case, but the point is, there must be an ancestral relationship, as a consequence of basic probability. This becomes more compelling as the percentage increases above 25%, and decrease below 100%, and becomes basically impossible to argue with quickly in both cases.

As such, the attached code sets a window within which two genomes are treated as a match, with the minimum match set to 70%, and the maximum match set to 96%. I came up with these numbers because a significant portion of the global population is a 70% match with Denisovan mtDNA, and a large portion of the global population is a 96% match with Heidelbergensis, suggesting that if an ancestral relationship exists over even an enormous amount of time (i.e., hundreds of thousands of years), you shouldn’t be much further off than that.

Specifically, 100% of both the Iberian Roma and Papuans (i.e., from Papua New Guinea) in the dataset below are a 96% match with Heidelbergensis. As a consequence, they must be truly ancient people, since Heidelbergensis is believed to have gone extinct hundreds of thousands of years ago. They must be a mutation off of Heidelbergensis, or even more interestingly, possibly predate or have a common ancestor with Heidelbergensis. Therefore, in every case, the Romani and Papuans must be hundreds of thousands of years old, it simply must be true, or we’re wrong about when Heidelbergensis went extinct.

That follows from basic probability (again see Section 6 of the paper above), but what’s really interesting, using the algorithm below, is that it seems a lot of people have an ancestral connection to the Phoenicians, including the Scandinavians, which is something I hypothesized a long time ago, because of the fact that they’re both ship-building people, that lived in city-states, and also since some ancient Runes (i.e., the Viking alphabet), appear to be Semitic. They also seem to have gods in common, specifically Adon and Odin, and their sons Baal and Baldr, Canaanite and Norse, respectively. Here’s the distribution of potential ancestral relationships for the Norwegian genomes in the dataset, and you’ll note the plain connection to the Phoenicians (acronym PH), who are in turn also closely related to the Sardinians (acronym SR). Note that the dataset has been diligenced to ensure that e.g., a Norwegian genome is collected from an ethnically Norwegian person, as opposed to a person located in Norway. All genomes are taken from the NIH Database, and the dataset is therefore courtesy of the NIH.

The height of each column shows the percentage of the maximum possible number of matching genomes for each population.

Here’s the command line code, any subroutines can be found in A New Model of Computational Genomics, together with the dataset itself:

https://www.dropbox.com/s/w1m2j5lsvj232ku/Ancestral_Connections_CMDNLINE.m?dl=0


Discover more from Information Overload

Subscribe to get the latest posts sent to your email.

Leave a comment