Another Note on Alignment

I noted yesterday that not only is local alignment intractable (if you treat every base index as a potential insertion / deletion), it doesn’t seem to matter much compared to global alignment. I actually tested this experimentally on the mtDNA dataset I’ve put together, and it turns out that in the best case, you add an average of 41.42 matching bases by accounting for local alignment. Specifically, I ran Nearest Neighbor on every genome, and then accounted for all sequences of unequal bases between a given genome and its nearest neighbor, that were at least 10 bases long. This is necessary otherwise you end up shifting bases locally in sequences that are short, and therefore could be the result of chance. In some cases local alignment does add a significant number of bases, and you can see that in the chart below that shows the maximum percentage increase that could be achieved using local alignments over sequences of at least 10 bases long. However, you can also see, it’s extremely rare, and typically around 0 bases. The x-axis shows the genome index in the dataset, and the y-axis shows the ratio between (a) the extra matching bases due to local alignment divided by (b) the original matching bases count, which peaks around 15% of the genome size. That is, it shows the percentage increase in matching bases due to accounting for local alignment. Moreover, this is the absolute maximum, that assumes a single shift in a sequence of M bases will produce M-1 matching bases, which is obviously not guaranteed. The plain take away is, unless you’re looking for genes, or looking for insertions and deletions, local alignment is probably not important. Specifically, my entire model is predicated upon allowing a machine to examine an entire genome, rather than look at individual genes or regions. Ironically, this algorithm is probably useful for identifying potential insertions and deletions, for the simple reason that it identifies significantly long sequences of unequal bases.

Here’s the dataset:

https://www.dropbox.com/s/ht5g2rqg090himo/mtDNA.zip?dl=0

All genomes are taken from the NIH, and each has a provenance file that links to the NIH Database.

Here’s the code:

https://www.dropbox.com/s/sz032x9py1vk3qf/Calc_Max_Match.m?dl=0

https://www.dropbox.com/s/npnz8ljzoh3sxzy/Max_Match_Count_CMNDLINE.m?dl=0

A Note on Alignment

I was working on a problem involving Heidelbergensis, and it dawned on me that local alignment, and global alignment, are fundamentally different problems. Specifically, if you want to find an optimum global alignment for mtDNA, you can shift the genome incrementally, and compare it to some reference genome, until you maximize the number of matching bases. If you do this locally, the arguably correct way to do this, is to treat every base index, as a potential insertion or deletion. This is intractable, despite the fact that mtDNA is finite. This is obviously not a sensible way to attack the problem. In fact, because you’re definitely going to get non-linear changes in match count as a function of shifting, there’s no way that a generalized optimization algorithm will solve this problem. This implies that as a general matter, global alignments are the correct way to align mtNDA.

Heidelbergensis as Ancestor

I discovered an ancient Chinese genome in the NIH Database that implies that the Iberian Romani predate Heidelbergensis. The reasoning is straightforward, and impossible to argue with. Specifically, if mtDNA genome A is the ancestor of both genomes B and C, then it is almost certainly the case (as a matter of probability) that genomes A and B, and A and C, have more bases in common than genomes B and C. That is, A has more in common with both B and C, than B and C have in common with each other. This follows from basic probability, which you can read about in Section 6 of my paper, A New Model of Computational Genomics. The intuition is simple, specifically, that if you fix a set of bases in genomes B and C (i.e., those inherited from ancestor genome A), then genomes B and C are almost certainly going to diverge from that set as they mutate over time, rather than randomly develop new bases in common by chance. As a consequence, assuming they both descend from A, they should not develop even more bases in common as a function of time.

In this particular case, fixing genome A as Iberian Romani, genome B as Heidelbergensis, and genome C as the ancient Chinese genome, we find that A and B have 97% of bases in common, A and C have 65% of bases in common, and B and C have 63% of bases in common. As a consequence, the most likely arrangement is that A (the Iberian Romani), are the ancestors of both B and C. This doesn’t imply that it is the case, but it is the most likely case, since assuming Heidelbergensis is the ancestor of the other two, requires assuming that the Iberian Roma and the ancient Chinese genome spontaneously developed 331 additional bases (i.e., 2% of the full genome) in common by chance, which is extremely unlikely. If the Romani in fact predate Heidelbergensis, they would almost certainly be the most ancient living humans. The fact that they are a 96% match to Heidelbergensis is alone compelling evidence for this claim.

Moreover, even if you account for local alignment, you end up with the Iberian Roma and Heidelbergensis equally likely to be the ancestor of the other. Specifically, assuming you’ve maximized the global alignment (i.e., shifted the genomes as a whole to maximize the percentage of matching bases), the best you can do after that is to account for local insertions and deletions. These will appear in the gaps between matching bases. It turns out, even if you make the best case assumption, which is that a shift by 1 in a gap of length M will produce M-1 matching bases, you still end up with A and B having 99.93% of bases in common, A and C having 99.83% of bases in common, and B and C having 99.83% of their bases in common. This implies that both the Iberian Roma and Heidelbergensis are equally likely to be the common ancestor genome. Note that this is arguably bad practice, because it assumes a large number of small shifts, that could of course be the result of chance. The bottom line conclusion is that the Iberian Roma are seriously ancient people. The Iberian Roma are also a 99% match for the Papuans in Papua New Guinea.

An interesting observation during this process, if you consider only gaps of appreciable length, you barely move the match count. I’ll test this tomorrow, but it suggests that once you fix the global alignment, the local alignments that are statistically meaningful (i.e., too long to be the credible result of chance), don’t add anything material to the match count, even under the best case assumption of simply assuming the entire gap would match if shifted by 1. It also suggests again, the Roma are more likely to be the ancestor of the three genomes, since considering only long gaps (i.e., at least 10 bases long), barely changed the match counts and didn’t change their ordinal relationships.

Here’s the dataset. All of the genomes are taken from the NIH, and have provenance files with links to the NIH Database.

https://www.dropbox.com/s/ht5g2rqg090himo/mtDNA.zip?dl=0

Computing Ancestry

I presented an algorithm that builds a graph showing possible ancestral connections among genomes, which you can find in Section 6 of my paper, A New Model of Computational Genomics. The basic idea is that given genomes A, B, and C, if genome A is the ancestor genome of both genomes B and C, then it is almost certainly the case (as a matter of probability) that genomes A and B, and A and C, will have more in common with each other than genomes B and C. This is for the simple reason that it is far more likely that both genomes B and C will mutate away from genome A, divergently, than it is that both B and C will somehow spontaneously develop common bases.

For exactly the same reason, if two genomes A and B have more than 25% of their bases in common (i.e., chance), but less than 100% of their bases in common, then they almost certainly have an ancestral connection. Specifically, there are exactly three possibilities: (i) genomes A and B have a common ancestor; (ii) genome A is the ancestor of genome B; (iii) genome B is the ancestor of genome A. You can’t say which is the case, but the point is, there must be an ancestral relationship, as a consequence of basic probability. This becomes more compelling as the percentage increases above 25%, and decrease below 100%, and becomes basically impossible to argue with quickly in both cases.

As such, the attached code sets a window within which two genomes are treated as a match, with the minimum match set to 70%, and the maximum match set to 96%. I came up with these numbers because a significant portion of the global population is a 70% match with Denisovan mtDNA, and a large portion of the global population is a 96% match with Heidelbergensis, suggesting that if an ancestral relationship exists over even an enormous amount of time (i.e., hundreds of thousands of years), you shouldn’t be much further off than that.

Specifically, 100% of both the Iberian Roma and Papuans (i.e., from Papua New Guinea) in the dataset below are a 96% match with Heidelbergensis. As a consequence, they must be truly ancient people, since Heidelbergensis is believed to have gone extinct hundreds of thousands of years ago. They must be a mutation off of Heidelbergensis, or even more interestingly, possibly predate or have a common ancestor with Heidelbergensis. Therefore, in every case, the Romani and Papuans must be hundreds of thousands of years old, it simply must be true, or we’re wrong about when Heidelbergensis went extinct.

That follows from basic probability (again see Section 6 of the paper above), but what’s really interesting, using the algorithm below, is that it seems a lot of people have an ancestral connection to the Phoenicians, including the Scandinavians, which is something I hypothesized a long time ago, because of the fact that they’re both ship-building people, that lived in city-states, and also since some ancient Runes (i.e., the Viking alphabet), appear to be Semitic. They also seem to have gods in common, specifically Adon and Odin, and their sons Baal and Baldr, Canaanite and Norse, respectively. Here’s the distribution of potential ancestral relationships for the Norwegian genomes in the dataset, and you’ll note the plain connection to the Phoenicians (acronym PH), who are in turn also closely related to the Sardinians (acronym SR). Note that the dataset has been diligenced to ensure that e.g., a Norwegian genome is collected from an ethnically Norwegian person, as opposed to a person located in Norway. All genomes are taken from the NIH Database, and the dataset is therefore courtesy of the NIH.

The height of each column shows the percentage of the maximum possible number of matching genomes for each population.

Here’s the command line code, any subroutines can be found in A New Model of Computational Genomics, together with the dataset itself:

https://www.dropbox.com/s/w1m2j5lsvj232ku/Ancestral_Connections_CMDNLINE.m?dl=0