mtDNA Alignment

I’m planning on turning my work on mtDNA into a truly formal paper, that is more than just the application of Machine Learning to mtDNA, and is instead a formal piece on the history of humanity. As part of that effort, I revisited the global alignment I use (which is discussed here), attempting to put it on a truly rigorous basis. I have done exactly that. This is just a brief note, I’ll write something reasonably formal tomorrow, but the work is done.

First, there’s a theoretical question: How likely are we to find the 15 bases I use as the prefix (i.e., starting point) for the alignment, in a given mtDNA genome? You can find these 15 bases by simply looking at basically any mtDNA FASTA file in the NIH website, since they plainly use this same alignment. Just look at the first 15 bases (CTRL + F “gatcacaggt”), you’ll see them. Getting back to the probability of finding a particular sequence of 15 bases in a given mtDNA genome, the answer is not very likely, so we should be impressed that 98.34% of the 664 genomes in the dataset contain exactly the same 15 bases, and the remainder contain what is plainly the result of an insertion / deletion that altered that same sequence.

Consider first that there are 4^{15} = 1,073,741,824 sequences of bases of length 15, since there are 4 possible bases, ACGT. We want to know how likely it is that we find a given fixed sequence of length 15, anywhere in the mtDNA genome. If we find it more than once, that’s great, we’re just interested initially in the probability of finding it at least once. The only case that does not satisfy this criteria, is the case where it’s not found at all. The probability of two random 15 base sequences successfully matching at all 15 bases is \frac{1}{4^{15}}. Note that a full mtDNA genome contains N = 16579 bases. As such, we have to consider comparing 15 bases starting at any one of the N - 14 indexes available for comparison, again considering all cases where it’s found at least once as a success.

This is similar to asking the probability of tossing at least one heads with a coin over some number of N - 14 trials. However, note in this case, the probability of success and failure are unequal. Since the probability of success at a given index is given by p_s = \frac{1}{4^{15}}, the probability of failure at a given index is p_f = 1 - p_s. Therefore, the probability that we find zero matches over all N - 14 indexes is given by P_f = p_f^{N-14}, and so the probability that we find at least one match is given by 1 - P_f = 0.0000154272. That’s a pretty small probability, so we should already be impressed that we find this specific sequence of 15 bases in basically all human mtDNA genomes in the dataset.

I also tested how many instances of this sequence there are in a given genome, and the answer is either exactly 1, or 0, and never more, and as noted above, 98.34% of the 664 genomes in the dataset contain the exact sequence in full.

So that’s great, but what if these 15 bases have a special function, and that’s why they’re in basically every genome? The argument would be, sure these are special bases, but they don’t mark an alignment, they’re just in basically all genomes, at different locations, for some functional reason. We can address this question empirically, but first I’ll note that every mtDNA genome has what’s known as a D Loop, suggesting again, there’s an objective structure to mtDNA.

The empirical test is based upon the fact that mtDNA is incredibly stable, and offspring generally receive a perfect copy from their mother, with no mutations, though mutations can occur over large periods of time. As a result, the “true” global alignment for mtDNA should be able to produce basically perfect matches between genomes. Because there are 16,579 bases, there are 16,579 possible global alignments. The attached code tests all such alignments, and asks, which alignments are able to exceed 99% matches between two genomes? Of the alignments that are able to exceed 99%, 99.41% of those alignments are the default NIH alignment, suggesting that they are using the true, global alignment for mtDNA.

Code attached, more to come soon!

Discover more from Information Overload

Subscribe to get the latest posts sent to your email.

Leave a comment