Religion, Caste, and Genetics

I found an article a while back claiming that Western people went to India [1], and placed themselves on the top of the Hindu caste system. This might have happened, but I’m now of the view that the reason for the genetic overlap between some Europeans and Africans, on the one hand, and Asians generally, on the other, is because of a migration-back to the West, in particular to Scandinavia and Nigeria. First, it is accepted that the Roma are closely related to the Dalit caste of India, based upon genetics. Further, it is obvious that some people in Scandinavia and Nigeria are related to Indians, specifically, both are related to the Munda people, who are in turn not related to the Roma at all. The logical conclusion, is that the Munda people are not of the Dalit class, and that some Western people, including Africans, are related to non-Dalit Indians. The fact that some Africans plainly descend from non-Dalit Indians (again with basically no relationship to the Dalit) places doubt on the claim that Europeans invented the Hindu caste system, and is instead consistent with the claim that the Hindu caste system is ancient, and was carried back to Europe and Africa during a much earlier migration back to the West. Finally, when we look at Buddhist countries, where there is no caste system, we plainly see a closer relationship to the Roma, in particular, in Mongolia, and to a lesser extent, in Thailand. And again, this is notable, because the Munda have basically no genetic connection to the Roma at all, suggesting the caste system was strictly enforced, and as a consequence, some Europeans and Africans also have basically no genetic connection to the Roma. Therefore, it is at least consistent with the facts that Buddhism lead to a change in the genetics of parts of Asia, presumably on account of the absence of a caste system, creating a more genetically heterogeneous society that included people of Dalit descent.

Returning to the hypothesis in [1], it is therefore of course possible that some Europeans and Africans simply descend from ancient, non-Dalit Indians, rather than the other way around. Moreover, Europeans and Africans generally do have a meaningful connection to the Roma, even in Scandinavia, suggesting again that the caste system did not originate in the West. There are however, as noted, exceptions, in particular, the Icelandic and the Igbo, who have, again, no noticeable genetic relationship to the Roma. This is at least consistent with the hypothesis that both people descend from an ancient, proto-Hindu society, and by that, I mean a society that actively enforced a caste system, excluding genetically Dalit people, even if they didn’t have a written religion where the Dalit were effectively cut-off from reproduction with others.

We see this also in Jarkhand India, Java, the Solomon Islands, and to a lesser extent, Indonesia, where again, we find people with basically no genetic relationship to the Dalit. It is of course possible that Hinduism proper is responsible for this in Java, but it makes no sense at all to assume that Hinduism is responsible for the absence of a genetic relationship between the people of the Solomon Islands and the Dalit. It makes more sense to instead assume that Hinduism memorialized ancient, existing, ethnic mating practices in Asia and the Pacific, and that Buddhists consciously abandoned these practices, thereby changing demographics in at least Mongolia and Thailand. Interestingly, there’s at least some evidence that something similar was happening in the Maritime Archaic, where people that are closely related to Jews only mated with each other, even though the genomes in question almost certainly predate Judaism, and in any case, it is not credible to claim that there were practicing Jews in Canada before Christ. The net point being, that as religion, and written systems generally, developed, they memorialized existing practices, including what populations were perceived as acceptable for marriage and mating generally. This hypothesis would therefore view at least some early religions as codifying potentially ancient behaviors, that predate written language altogether.

Ancient Finnish mtDNA

I read an article last night claiming that a set of ancient Finnish remains from the Iron Age is related to the Sami people. I don’t disagree, but they’re much closer to the Russians than the Sami, and in general, these are plainly Roma people, that are in turn related to Heidelbergensis (just like the Russians). I’m not going to completely dismiss the results of a peer-reviewed article in Nature, though at the same time, my work is incomparably more precise than typical genetic analysis. See Section 7.1 of A New Model of Computational Genomics [1]. As such, I’m going to assume that they are related to the Sami (which is consistent with my work), and that the modern day Sami are a mix between these ancient people, and others that do not descend from Heidelbergensis, which would produce the match distribution on the left below, for the modern Sami, that shows a mix of Roma and non-Roma populations. So on net, I would say that this ancient Finnish population eventually mixed with people that are more closely related to modern day Sami, specifically the Saqaaq, over time, eventually producing the modern genetic distribution of the Sami people.

Below is the updated dataset that now includes 10 of these ancient Finnish genomes. All of the code you need to run these examples is in [1].

https://www.dropbox.com/s/zwt1bcqqmqkleca/mtDNA.zip?dl=0

Selection and the Vanishing of Traits

I think I just figured out why human beings lost basically all of their hair (versus primates), and the answer is, we stopped selecting for it. That alone shouldn’t matter, but if you add in a hypothesis that more or less constant mutation happens, on some level, then traits that are not actively selected for, will eventually vanish. This is basically an entropy of genetics, that would require constant effort, or environmental pressure, to maintain the traits of a species. In the case of body hair for humans, we stopped selecting for it because we developed the ability to use animal pelts, and as a consequence, both the environment, and possibly the individuals in question, stopped selecting for body hair, and presumably started selecting for other things.

Given that people still have hair on their heads, and to some extent on their bodies, it must have some utility, even if it’s just aesthetic, though this doesn’t undermine the more general thesis, that traits simply vanish, if not selected for, which is superficially impossible to argue with, for the simple reason that mutation is real, and as a result, all traits will be subject to what is basically erosion. If that erosion is significant, the trait in question could dwindle and vanish.

A New Model of Computational Genomics

I’ve updated my formal paper on genetics, A New Model of Computational Genomics, which now includes more theory, and experimental data regarding imputation. The most important improvement is with regards to the discussions surrounding the predictive power of the software, which allows ethnicity to be predicted with about 80% accuracy. In contrast, simulating a haplogroup, by identifying all bases common to a population (which would therefore include all genes common to that population), and using that to predict, had no predictive power at all, producing an accuracy consistent with chance. The bare minimum interpretation is that haplogroups are not precise enough to predict ethnicity at the level of a nationality. It’s also possible that they simply lack predictive power, which would at least call contemporary genetics into question. This doesn’t mean that they’re insignificant, it could however imply that they lack the significance necessary to make accurate, and narrow predictions given a genome of unknown provenance.

Enjoy!

Charles

Predicting Ethnicity and Haplogroups

I noted in my paper, A New Model of Computational Genomics [1], that imputation using sequential bases is categorically inferior to using random bases, in several experiments testing the extent of imputation. That is, if you select K sequential bases (e.g., a particular gene), and attempt to predict the remainder of the genome using only that sequence, you underperform when compared to selecting K random bases in the genome. Because genes are sequential within a genome, it suggests that analyzing genes, and therefore haplogroups, might not be the best way to predict ethnicity, and therefore ancestry. This seems to be the case empirically.

Specifically, the attached code generates a set of bases in common for every population in a dataset of human mtDNA genomes. For example, the algorithm finds all bases that are common to Chinese individuals, and stores that as what is in essence a reference genome for the Chinese population. If a gene is common to all Chinese individuals, then it must be included in this reference genome, since the reference genome contains all bases common to the Chinese, and therefore, all genes common to the Chinese, in addition to any other bases they share as a population. All of the genomes are complete genomes taken from the NIH Database, and include provenance files with links to the NIH Database.

The next step is to predict the ethnicity of an individual using those reference genomes. Specifically, the algorithm takes a given testing genome, and finds the reference genome to which it is most similar. This process has an accuracy of approximately 1.2\%. There are 56 ethnicities in the dataset, and therefore this process performs about as well as chance, which is \frac{1}{56} = 1.8\%. The total runtime is a few minutes.

 

Haplogroups are plainly not precise, which you can see in the map above, that shows haplogroups crossing national boundaries. Moreover, discovering haplogroups requires a lot of work. In contrast, the software in [1] is capable of predicting ethnicity at the national level, with no human analysis ex ante, with an accuracy of about 80%. For example, the algorithms in [1] can discern between Swedes and Norwegians, whereas the haplogroups shown above plainly cannot, and instead Swedes and Norwegians are grouped together, though both are distinguished from Finns. Moreover, the attached code casts serious doubt on using genes and haplogroups for analyzing ancestry, since they’re apparently incapable of predicting ethnicity, which should be easier. That is, ancestry posits something in addition to ethnicity, which is that one ethnicity is the ancestor of another, and therefore, ancestry should be more difficult to predict than ethnicity alone.

My opinion is that these results suggest circular reasoning in the construction of haplogroups, where national, geographic, and language groups are used to define populations, and then common genes are identified, rather than allowing the genomes themselves to define groups of people, without reference to anything exogenous to the genomes. Moreover, this software shows that common genes do not allow you to predict ethnicity. In contrast, the software in [1] learns from a dataset of stated ethnicities, and is then able to predict the ethnicity of other genomes, without any human analysis at all. And again, the software in [1] is plainly more precise than haplogroups, in any case. Therefore, taken as a whole, [1] appears to present a superior method of analyzing ethnicity and ancestry, which is to use whole-genomes, treat the stated national / linguistic ethnicities as bona fide, and allow software to identify any relevant features. Moreover, the software in [1] also allows for the construction of populations that are based solely upon the genomes themselves, thereby allowing for the mechanistic, and therefore objective, construction of genetic groups, independent of national, geographic, and language groups.

Here’s the code and the dataset, and any missing code is linked to in [1]:

https://www.dropbox.com/s/6x8796m9hi9h934/Uniform_Bases_Prediction_CMDNLINE.m?dl=0

https://www.dropbox.com/s/zwt1bcqqmqkleca/mtDNA.zip?dl=0

Javanese mtDNA

Again mostly due to chance, I found a Javanese genome (modern) in the NIH Database, and it is notable because at even 30% of the genome, there is no match to Heidelbergensis. This is not true for many of the populations in the dataset, which at this point contains 58 global ethnicities. The logical conclusion, is that the Javanese people are an isolated, modern population, that are closely related to very early humans, and no one else, save for the Neanderthals and Denisovans. This is really interesting, because e.g., the Norwegians, who are plainly geographically isolated, are related to basically everyone at 30% of their genome, which you can see below. In contrast, this Javanese genome produces a very thin distribution at even 30%, which is only 5% above chance. All of the code can be found in my paper, A New Model of Computational Genomics, and the dataset is linked to below.

https://www.dropbox.com/s/zwt1bcqqmqkleca/mtDNA.zip?dl=0

Ancient Khoisan mtDNA

I’m working on something completely different related to ancient mtDNA, and I happened to find an ancient Khoisan genome in the NIH database. I also noticed earlier today, again working on something different, that both the Nigerians and Kenyans seems to have a relationship to the Denisovans. I already knew that the Kenyans were related to Denisovans, whereas, I never noticed any connection between the Nigerians and Denisovans. This prompted me to ask whether they had at least something more than chance in common with Denisovans, and the answer is yes. Specifically, the Nigerians start to match with Denisovans at about 30% of their genome. This is 5% above chance, and as a consequence, it is not possible that it is the result of chance. See, A New Model of Computational Genomics [1], specifically, footnote 16, which goes through the math.

There are two possibilities: one is that the Nigerians had a fleeting relationship with Denisovans, which caused only subtle changes to their mtDNA (see Section 5 of [1]). The other possibility is that they have an ancient, and possibly archaic connection to Denisovans. There is an ongoing search for so-called “Southern Denisovans”, since Denisovan fossils are typically found in Asia, not Africa. If Denisovans are actually from Asia, then we should not find ancient Denisovans in Africa. As it turns out, this particular genome is closely related to both Denisovans and Neanderthals, and is much closer to Denisovans than Neanderthals. You’ll also note that this genome is related to the Nigerians, again suggesting, an ancient connection between the Denisovans and Nigerians. Though this is not an archaic genome, since it’s only about 3,000 years old, it is ancient, and therefore consistent with the hypothesis that all hominins, i.e., Denisovans, Homo Sapiens, Neanderthals, and Heidelbergensis, all come from Africa. Below is the normalized match count for the Ancient Khoisan genome, at 50% of the genome. All of the code you need to run this analysis is in [1], and the dataset can be found here.

Ukrainian mtDNA

I hypothesized that many Ukrainians would be related to the Vikings, because of my admittedly loose understanding of the history, and it seems that I was correct. Specifically, the Ukrainians appear to be a mix of both Russian (not surprising) and Scandinavian heritage. What is surprising, is that they are also closely related to the Pashtuns of Afghanistan and Pakistan, who were also subjected to genocide by the Russians. This might be a coincidence, but I doubt it at this point, and I suspect instead, that this group of people (which includes many Jews, both Sephardic and Ashkenazi) has been the target of genocide for at least a century at this point, and that many Communist states deliberately exterminated exactly this bloodline of people. The chart below shows the distribution of ethnicities that are a 99% match to the Ukrainians.

 

Note that it must be the case that mtDNA contains information about paternal lineage, since my software can predict ethnicity, using mtDNA alone, with an accuracy of about 80%. This would be impossible if mtDNA did not contain information about paternal linage, and I’ve shown experimentally that the mtDNA of two populations does converge to a single, new set of genomes, almost certainly due to paternal selection. Further, note that PT stands for Pashtun, IL stands for Icelandic, and UK stands for Ukrainian (EN stands for English). The complete list of acronyms can be found at the end of my paper, A New Model of Computational Genomics [1].

Here’s the updated dataset, and any code required to generate the chart above can be found in [1].

https://www.dropbox.com/s/re0ww4yisdstx5z/NN%20Population_CMDNLINE.m?dl=0

On the Origins of Humanity

There’s apparently some debate about whether humans come from Africa, or from Asia, and after not reviewing much of the literature (being honest), and instead conducting my own research in genetics, I’ve concluded that we all come from Africa, and that many of us migrated to Asia, possibly Central Asia, and then some of us migrated from Asia, back to Africa and Northern Europe, and the Pacific. See, A New Model of Computational Genomics, generally. Specifically, it looks like some Scandinavians, Thai, Japanese, Khoisan, and Nigerian people are all very closely related to each other, to the point of 90% plus matches on the maternal line. I’ve shown that mtDNA must carry information about the paternal line as well, since my software can predict ethnicity with about 80% accuracy. As a consequence, it follows, that some Scandinavians, Thai, Japanese, Khoisan, and Nigerian people are all very closely related to each other, as a general matter. This is not to the exclusion of other people, it’s just most obvious in these populations. Therefore, I am of the belief that humanity began in Africa, which is in my opinion based in archeology, and not genetics. Specifically, that archeological evidence of early humans is most prevalent in Africa. Below is a plot from Wikipedia that shows the global distribution of tools associated with archaic humans from about one-million years ago, to about one-hundred-thousand years ago.

 

In contrast, the migration-back hypothesis, is in my opinion, rooted in genetics. Specifically, that you find simply inexplicable connections between global populations, in particular, certain Northern Europeans, Africans, and Asians. These relationships make no sense in the context of known history, and instead, make perfect sense, in the context of genetics, and common sense. Why did the early Egyptians appear to be Asian? Why do the Khoisan to this day appear to be Asian? Why are Stave Churches plainly reminiscent of Thai temples? One simple solution, is that all of these people are part a single group of people, that migrated back to the West, from Asia. On the left is a Norwegian Stave Church, to its right is a Thai Temple, after that is Menkaure and Queen Khamerernebty II (c. 2,530 BCE), courtesy of MFA Boston, after that Nefertiti (c. 1,370 BCE), courtesy of Wikipedia, and on the bottom right is Cleopatra (c. 50 BC), courtesy of Wikipedia, who plainly looks nothing like the rest of them.

Convergence of mtDNA

I’ve noted before that mtDNA must provide information about the paternal line, since I’ve written software that can predict ethnicity with about 80% accuracy, without any filtering for confidence, using mtDNA alone. See, A New Model of Computational Genomics [1], generally. Because ethnicity is a combination of both paternal and maternal ethnicity, there’s just no argument to the contrary – the accuracy would otherwise be horrible. I’ve developed reasonable hypotheses to explain this, specifically, the selection of particular maternal lines is probably a decent explanation for the fact that mtDNA must carry information about paternal ethnicity. That is, males in a given geography prefer particular females, for whatever reason, and that produces a unique distribution of maternal lines, which in turn, identifies the paternal lines.

However, some of my results suggest more direct influence from the paternal line. Specifically, it seems at least plausible that males select females that have mtDNA bases in common with them, which would over many generations cause the two maternal lines to fuse into one. For example, a Norwegian individual, when selecting among mates in Sweden, will select the mate that has the maximum number of mtDNA bases in common. This behavior would, over time, cause both Norwegian and Swedish mtDNA to combine, since each generation would mate on the basis of the maximum number of bases in common. This is course a random example, but I saw some evidence of this in the Danes, who seemed to be a mix between Swedes and Norwegians.

I’ve developed an experiment and software to test this hypothesis. Specifically, some populations are mixes between modern and archaic humans, and I’ve tested whether the introduction of archaic mtDNA impacts the modern mtDNA of the population in question. The experiment I’ve come up with is to test which Mongolians are at least a 60% match to Denisovans. There are 19 complete Mongolian genomes in the dataset, 8 Denisovan genomes, and 1 Heidelbergensis genome. All genomes are complete mtDNA genomes taken from the NIH Database, complete with provenance files for each genome linking to the genome descriptions. This gives each of the Mongolians genomes 8 chances to match with a Denisovan, and if a single match occurs, it is included in a list of genomes that are treated as in essence, Denisovan. Of the 19 Mongolian genomes, 4 were a match to at least 1 Denisovan. This leaves 15 genomes that did not match. The question is then, do the remaining 15 genomes have more in common with the Denisovans than a population that has no clear relationship to the Denisovans?

This is superficially impossible, because mtDNA is inherited directly from the mother to the child, typically with no mutations at all. However, my hypothesis is that males select females on the basis of genetic similarity. Specifically, that males attempt to maximize the number of bases in common with their female mate. This will, after generations, cause the mtDNA of the paternal line to converge with the mtDNA of the maternal line. Specifically for this experiment, it should be the case that the non-Denisovan Mongolian genomes have more bases in common with Denisovans than some other population that has no clear relationship to Denisovans. As a reference population with no clear relationship to either Denisovan or Heidelbergensis, I selected the English, and there are 9 English genomes in the dataset. The results suggest that I’m correct, since the average match count between a non-Denisovan Mongolian genome and the Denisovans is 4,957.9 bases, whereas the average match count between the English and the Denisovans is 4,673.2 bases. Applying the same methods to Heidelbergensis, we have 5,003.6 matching bases for the non-Heidelbergensis Mongolians, and 4767.4 bases for the English. The same is true of the Ashkenazi Jews, Kenyans, and Finns, all of whom have a similarly close relationship to the Denisovans. All of this is plainly consistent with the hypothesis that selection can alter mtDNA, specifically, selection by the paternal line.

Attached is the code and the dataset. Any missing code can be found in [1].

https://www.dropbox.com/s/e648xvrn1rls5rw/Mutation_Affinity_CMNDLINE.m?dl=0

https://www.dropbox.com/s/zwt1bcqqmqkleca/mtDNA.zip?dl=0