Paternal Lineage and mtDNA

In my paper, A New Model of Computational Genomics [1], I introduced an embedding from mtDNA genomes to Euclidean space, that allows for the prediction of ethnicity, using mtDNA alone (see Section 5). The raw accuracy is 79.00%, without any filtering for confidence, over a dataset of 36 global ethnicities. Chance implies an accuracy of about 3.0%. Because ethnicity is obviously a combination of both maternal and paternal DNA, it must be the case that mtDNA carries information about the paternal lineage of an individual. Exactly how this happens is not clear from this result alone, but you cannot argue with the result overall, which plainly implies that mtDNA, which is inherited only from the mother, carries information about paternal lineage generally. This does not mean you can say who your father is using mtDNA, but it does mean that you can predict your ethnicity with very good accuracy, using only your mtDNA, which in turn implies your paternal ethnicity.

One possible hypothesis is that paternal lines actually do impact mtDNA, through the process of DNA replication. Specifically, even though mtDNA is inherited directly from your mother, it still has to be replicated in your body, trillions of times. As a consequence, the genetic machinery, which I believe to be inherited from both parents, could produce common mutations due to paternal lineage. I can’t know that this is true, but it’s not absurd, as something has to explain these results.

Finally, note that this also implies that the clusters generated using mtDNA alone are therefore also indicative of paternal lineage. (See Section 4 of [1]). To test this, I wrote an analogous algorithm, that uses clusters to predict ethnicity. Specifically, the algorithm begins by building a cluster for each genome in the dataset, which includes only those genomes that are a 99% match to the genome in question (i.e., counting matching bases and dividing by genome size). The algorithm then builds a distribution of ethnicities for each such cluster (e.g., the cluster for row 1 includes 5 German genomes and 3 Italian genomes). Because there are now 411 genomes, and 44 ethnicities, in the updated dataset, this corresponds to a matrix that has 411 rows and 44 columns, each of which contains an integer, that indicates the number of genomes from a given population included in the applicable cluster. I then did exactly what I described in Section 4 of [1], which is to compare each distribution to every other, by population, ultimately building a dataset of profiles (the particulars are in [1]). The accuracy is 77.62%, which is about the same as using the actual genomes themselves. This shows that the clusters associated with a given genome contain information about the ethnicity of an individual, and therefore, the paternal lineage of that individual.

All of this implies that many people have deeply mixed heritage, in particular Northern Europeans, ironically touted as a “pure race” by racist people that apparently didn’t study very much of anything, including their languages, which on their own, suggest a mixed heritage. One sensible hypothesis is that the clusters themselves are indicative of the distribution of both maternal and paternal lines in a population. You can’t know this is the case, but it’s consistent with this evidence, and if it is the case, racist people are basically a joke. Moreover, I’m not aware of any accepted history that explains this diversity, and my personal guess (based upon basically intuition and not much else), is that there was a very early period of sea-faring, global people, prior to written history.

Attached is the code for the analogous clustering algorithm. The rest of the code and the dataset is linked to in [1].

https://www.dropbox.com/s/bmybgpxl1s5e1aq/Cluster-Based_Profile_CMNDLINE.m?dl=0


Discover more from Information Overload

Subscribe to get the latest posts sent to your email.

Leave a comment