Measuring the Diversity of Global Maternal Lines

Introduction

In a previous article, I showed that there are only 6 maternal lines that are a 99% match to 67.64% of global maternal lines, using a dataset of 377 complete mtDNA genomes, from 32 ethnicities. This suggests, that as a general matter, human beings are already extremely diverse, since a majority of the global population can be traced back to just a handful of maternal lines. There’s still the question of how we should measure this, and in this article, I’ll present a few methods that will allow us to quantify exactly how diverse a given population is.

Counting the Number of Maternal Lines in a Population

In the previous article, we counted the number of global maternal lines by building mutually exclusive clusters over the entire dataset of 377 genomes. This was done by first building a cluster of 99% matches for each genome in the dataset. That is, for a given genome A, if genome B has 99% of its bases in common with genome A, then genome B is included in the cluster for Genome A. We then sort the clusters by size, beginning with the largest cluster, and allocating all of the genomes in the largest cluster to that cluster, and removing them from all others. We then do this for the next largest cluster, and so on. This eventually produces mutually exclusive clusters.

If we limit this process to a given population, we will create mutually exclusive clusters that all belong to the same population. For example, if we begin with all of the Japanese genomes, and then apply this process, we will produce mutually exclusive clusters, each of which consists of genomes that are a 99% match to some given genome. As a consequence, this will partition a given population into distinct maternal lines, with each cluster containing genomes that are part of a distinct maternal line within the population in question. The table below shows the total number of genomes in each population, the number of clusters (i.e., distinct maternal lines) in each population, and the average cluster size, for each of the 32 population ethnicities. As you can see, the only truly homogenous populations are the Kazakh, Nepalese, and Iberian Roma, whereas everyone else is fairly heterogenous. 

EthnicityNo. GenomesNo. ClustersAvg. Cluster Size
1. Kazakh3065.00
2. Nepalese2036.67
3. Iberian Roma19119.00
4. Japanese20102.00
5. Italian19101.90
6. Finnish20131.54
7. Norwegian2092.22
8. Swedish2082.12
9. Chinese20121.67
10. Indian1872.57
11. Nigerian961.50
12. Egyptian2082.50
13. Russian632.00
14. Spanish1391.44
15. Danish951.80
16. Maritime Archaic1061.67
17. Ashkenazi Jewish1863.00
18. Scottish18101.80
19. Mexican331.00
20. Chachapoya1061.67
21. Pre-Roman Egyptian (4,000 B.P.)111.00
22. Homo Heidelbergensis111.00
23. Mayan1042.50
24. Khoisan10101.00
25. English971.29
26. Ancient Roman522.50
27. Sardinian522.50
28. Basque422.00
29. Georgian221.00
30. German971.29
31. Denisovan111.0
32. Neanderthal111.0

Measuring the Global Reach of a Population

We can apply a similar process for a selected population over the entire dataset. That is, we first take all of the genomes in a given selected population, and then build clusters by finding all other genomes, over all populations, that are a 99% match with a given genome from the selected population. We then build mutually exclusive clusters in the exact same manner we did above, first sorting by cluster size, and then allocating the matching genomes in size order. This will allow us to find the breadth of global populations that match to a given genome from the selected population, and will again partition the population, because not all genomes from the selected population will produce non-empty clusters. However, in this case, we will consider every non-empty cluster, rather than impose a minimum size. This will allow us to distinguish between a population that is simply heterogenous, as opposed to global. For example, Nigerians are heterogenous in that they have numerous maternal lines, however only one of the maternal lines is truly global, which shows a plain connection to Northern Europeans, Norwegians and Scotts in particular. It’s tempting to write these connections off due to the slave trade, but this just doesn’t really hold up in the case of Japan and China, or even more peculiar, Kazakstan and the Chachapoyas. The bottom line is that an ethnically Nigerian maternal line is a basically perfect match for the ethnicities below, which does not have a simple explanation in known history (to my knowledge). In my opinion, it makes much more sense to instead assume that truly inexplicable cases like these are due to ancient migration patterns that are still observable today, simply because mtDNA doesn’t change much over time, and in some cases, enormous periods of time.

Applying this to the Japanese population, this produces 11 clusters, with a total of 210 genomes, or 56% of the dataset, which suggests that the Japanese maternal line is quite global, despite the reputation of being an insular nation. It turns out that Japan is only recently insular, as it started in the 1600’s in response to Spanish, Portuguese, and Catholic attempts to impose colonial rule, and even enslave Japanese people. This of course leaves open the rest of human history, which is hundreds of thousands of years old, providing plenty of opportunity for the diversity that is obviously present in literally every population, other than the Kazakhs, Roma, and Nepalese people. In the case of the Japanese, you see a simply incredible scope of global populations, and below are the most interesting clusters I noticed. Among them is a Japanese genome that is a perfect match for 6 out of the 10 Ancient Mayan genomes, and nothing else, suggesting the individual in question is quite literally of Ancient Mayan heritage.

Keep in mind the dataset has been diligenced to ensure that the GenBank notes either explicitly state or plainly suggest that the person in question is of the ethnicity in question. Moreover, there’s a link for each genome to the NIH Database, where you can check the provenance yourself. So if e.g., a genome is classified as Japanese, then the GenBank notes indicate that the person is ethnically Japanese, as opposed to the genome simply being collected from a person in Japan. Because of this, and the 99% threshold, you simply cannot argue with these results: humanity is already extremely diverse, suggesting a rich and ancient history that is arguably unknown to us, that will probably be discoverable at least initially only through genetics, rather than archaeology. Because mtDNA is so stable over time, it makes perfect sense as the initial point of inquiry.

Again, the only things you need to allow for expansive global trade are sailboats and telescopes, and the bottom line is, the people of Polynesia got there somehow, and they certainly didn’t use unguided rowboats. Moreover, the Ancient Romans had glass, and careful observation of optics through water would suggest that vision can be adjusted using materials, including of course glass, which was plainly known to at least one ancient civilization. Finally, below is a table that shows the number of genomes per population, together with the applicable acronym used in the charts above, and below that is the dataset and command line code.

EthnicityGenome CountAbbreviation
1. Kazakh30KZ
2. Nepalese20NP
3. Iberian Roma19IB
4. Japanese20JP
5. Italian19IT
6. Finnish20FN
7. Norwegian20NO
8. Swedish20SW
9. Chinese20CC
10. Indian18IN
11. Nigerian9NG
12. Egyptian20EG
13. Russian6RU
14. Spanish13SP
15. Danish9DN
16. Maritime Archaic10MA
17. Ashkenazi Jewish18JW
18. Scottish18SC
19. Mexican3MX
20. Chachapoya10CH
21. Pre-Roman Egyptian (4,000 B.P.)1PRE
22. Homo Heidelbergensis1HB
23. Mayan10MY
24. Khoisan10KH
25. English9EN
26. Ancient Roman5AR
27. Sardinian5SR
28. Basque4BQ
29. Georgian2GA
30. German9GR
31. Denisovan1DS
32. Neanderthal1NA

Here’s the dataset:

https://www.dropbox.com/s/8jlwr49fhtstpre/mtDNA.zip?dl=0

Here’s the command line code:

https://www.dropbox.com/s/4v1fo2hkt76pjws/Count%20Unique%20Population%20Genomes.m?dl=0

https://www.dropbox.com/s/hi1ggfgqnat1dwo/Mut_Exc_Clusters_By_Class.m?dl=0


Discover more from Information Overload

Subscribe to get the latest posts sent to your email.

Leave a comment