mtDNA Alignment

In a previous note, I pointed out that many (and possibly nearly all) human mtDNA genomes “begin” (i.e., despite its circularity) with exactly the same 15 bases:

GATCACAGGTCTATC

In fact, because it’s circular, it makes perfect sense that there is a starting index, otherwise you run the risk of beginning protein production at different indexes, given the same genome, thereby producing different proteins. Other species seem to have their own opening ledes as well. A very small number of genomes in the NIH database do not contain this lede, but this is extremely rare in what I’m assuming is an enormous database, and though I haven’t done any formal analysis, I’ve found only about a dozen entries that do not contain exactly this sequence in the opening of their ideal alignment, using BLAST. That is, about a dozen genomes still contain this sequence, but not in the opening of the alignment that maximizes the number of matching bases. Further, some Japanese genomes contain minor deletions from this opening sequence, and therefore require minor adjustments to this alignment. In contrast, some of the genomes I found using BLAST require significant adjustments, effectively deleting around 570 bases from the genome, suggesting a significant deviation from a typical human mtDNA genome.

This suggests that as a general matter the correct empirical alignment for the human mtDNA genome begins with this sequence, despite the fact that it is circular, suggesting a useful and arguably “correct” starting point index, and this is in fact reflected in the NIH database, with basically all human mtDNA genomes I’ve found aligned with this opening sequence (including the roughly 200 complete genomes assembled in the dataset below).

In that same note, I pointed out that if you use this alignment, which the NIH plainly does as a general matter, you find that matching genomes converge to a match, and genomes that don’t match, diverge. Specifically, if you count the average number of bases from index 1 to index K, and increase K, you find that two genomes that in fact have a high number of matching bases produce a curve that plainly converges to around 99% to 100%. In contrast, two genomes that don’t match instead diverge from a high matching percentage to around 25% (i.e., chance). This produces curves that are useful for Machine Learning, since it implies an unsupervised clustering algorithm, where two genomes are clustered together if they produce an upward sloping curve, and otherwise, not clustered together. The plot above shows 10 Nigerian mtDNA genomes compared to a single Japanese genome. The x-axis is the genome index, and the y-axis is the percentage of matching bases, from index 1 up to the x-value. Most of the Nigerian genomes plainly do not match, and so they diverge, whereas some plainly do (converging at the top). There’s also an outlier in the middle, which you can consider as a third class that is a partial match, or simply disregard, as the bottom line is, this produces a useful clustering algorithm.

As I tested this more, I realized that you can also build a distribution of the indexes of unequal bases. Specifically, for each genome, run Nearest Neighbor, and find the indexes where that genome and its Nearest Neighbor differ. This will create a distribution, where each index is associated with some number of genome pairs that disagree at that index. The higher that number, the greater the number of instances of unequal bases at that index. When you plot this, you find exactly the same distribution, where the number of anomalously high unequal bases tapers off, which is consistent with the curves plotted above, that produce convergence as a function of index. The bar chart above shows all peaks that exceed the average plus one standard deviation, though you can of course make use of variations on this. Note the peak at the end is due to the fact that basically all of the genomes are missing that entry, causing all of them to be treated as unequal at that index (i.e., the algorithm first calculates equal bases, and disregards all missing entries, causing the compliment to include all instances of missing entries). Considering this further, it implies regions that are common to genomes are found in between the peaks of the chart above. This should make it easy to find genes, and more generally, sequences common to populations, which presumably have some function, even if it’s simply indicative of genetic grouping.

Specifically, the code attached produces 984 roughly homogenous regions. We can then calculate the length of each such region. The total sequence length of the homogenous regions is 15,592 bases, and the total length of the genome is 16,576 bases. This leaves 984 bases unaccounted for, as highly variable regions. This is approximately the length of the D-Loop, also known as the non-coding region, which apparently is a “hot spot for mtDNA alterations“. As a general matter, mtDNA is dense with genes, suggesting that we shouldn’t have too many inconsistent regions, and in fact we don’t. Moreover, you can plainly see that the inconsistent regions become more sparse, with a highly inconsistent and contiguous region in the beginning that could very well be the D-loop.

This is therefore an unsupervised algorithm that apparently correctly partitions this genome.

Here’s the code and the dataset:

https://www.dropbox.com/s/0y881tw2s7w91c8/Temp_CMDNLINE_12_18.m?dl=0

https://www.dropbox.com/s/4m6fhz77ki2rtg8/Genetic_Nearest_Neighbor_Fast.m?dl=0

https://www.dropbox.com/s/4m6fhz77ki2rtg8/Genetic_Nearest_Neighbor_Fast.m?dl=0

https://www.dropbox.com/s/casfm3i07v0vefl/Count_Matching_Bases.m?dl=0

Dataset:

https://www.dropbox.com/s/ht5g2rqg090himo/mtDNA.zip?dl=0

The Structure of mtDNA

There’s a plain structure to mtDNA, and astonishingly, every genome I’ve seen so far has exactly the same opening sequence of 15 characters, though some Asian peoples have deletions, but they’re otherwise exactly the same –

Literally the exact same opening sequence, globally, and it is as follows:

GATCACAGGTCTATC

This got me thinking that there’s an order to human mtDNA, that variation starts to take place after this opening, as a function of index. It seems that this is in fact the case. Even more interesting, when two genomes match beyond mere change, they produce a convergence towards the overall percentage of matching bases. That is, if you start at index 1, and read to the end of the genome, if two genomes match beyond chance, then the percentage of matching bases from 1 to the end starts to increase at a certain point. If instead, the two genomes have a match that is close to chance (i.e., roughly 1/4 of the bases match), then the percentage of matching bases decreases as a function of index. Here’s a plot of 10 Nigerian mtDNA genomes compared to a single Japanese genome. The x-axis is the genome index, and the y-axis is the percentage of matching bases, from index 1 up to the x-value.

This implies a clustering algorithm, where if the slope is negative on average, then it’s not a match. If instead the slope is positive on average, then there is a match.

Most of the Nigerian genomes are plainly not matches. However, there are two that are a 98% and 100% match to Japanese genomes, respectively (at the top). This implies unquestionable common maternal lineage. There’s a third, that you can see that seems to lag, and then catch up, which has a match percentage of 77%. This obviously implies a bit of judgment, but the algorithm makes perfect sense, and you can deal with these types of issues as you like.

The first and obvious takeaway is that political race is bull shit, and our history is questionable. The scientific takeaway is that mtDNA does seem to follow a chronology, from the first index to the last, and if this is true, then it seems there was an explosion of diversity in maternal lines early in our history, later leading to a convergence, more or less on par with modern maternal lines.

Here’s the code, anything missing (include datasets) can be found in the post just below this one:

https://www.dropbox.com/s/blzcyi7eyuqxu3a/Code%20%281%29.zip?dl=0

Maritime Archaic mtDNA

The Maritime Archaic mtDNA genomes available from the NIH Database are a 99% match (i.e., 99% of the bases match) for several European, African, and Asian people. Note these are complete genomes, and so it is impossible to deny common ancestry.

This is astonishing, because the strongest match is the Spanish, and this provides irrefutable evidence that people of European descent reached the Americas thousands of years before Columbus, as the Maritime Archaic people are dated somewhere between 7,500 and 3,500 years before present. You’ll also note that the Maritime Archaic samples are a 95% match to Homo Heidelbergensis, which is consistent with the hypothesis that Heidelbergensis is an ancestor common to all of humanity. Each chart shows the number of people in a given population that had a 99% match with the population in question. So the first chart shows the number of people from each population that had a 99% match to at least one Maritime Archaic genome. There are 10 rows in each population, over 17 populations, except for Heidelbergensis, for which only one complete genome is available.

The raw data together with the code and files providing provenance (i.e., direct links to the NIH Database for each row) are all available in two separate zip files:

https://www.dropbox.com/s/e0zf5eokcfdmi7s/MATLAB%20CODE.zip?dl=0

https://www.dropbox.com/s/lxq8gfb4h0p8edw/mtDNA.zip?dl=0

I suppose mtDNA doesn’t control much for superficial appearance, given these results, and the others I’ve been sharing lately, certainly not the factors we typically associate with race, but more remarkably, because it doesn’t change much over generations, it forces us to recognize the brevity of our considerations, that they’re informed by a few centuries or millennia, when the real history of humanity spans hundreds of thousands of years, possibly longer.

Using a simple measure of information, specifically N \times H, where N is the size of a distribution, and H is the entropy of the distribution, the Danes are the most diverse people in the world, with a 99% match to a simply astonishing variety of nationalities. Even more astonishing, if you lower the match threshold to about 95%, you’ll see that many modern populations are a match for Homo Heidelbergensis, an archaic human that was thought to have gone extinct hundreds of thousands of years ago, though it’s quite clear many modern humans are basically indistinguishable on their maternal line from this otherwise archaic species.

I suppose mtDNA doesn’t control much in the way of superficial appearance, given these results, and the others I’ve been sharing lately, certainly not the factors we typically associate with race, but more remarkably, because it doesn’t change much over generations, it forces us to recognize the brevity of our considerations, that they’re informed by a few centuries or millennia, when the real history of humanity spans hundreds of thousands of years, possibly longer. It’s not that this line of study doesn’t divide humanity, as it certainly does, but not along any political or racial basis I’ve seen before. Instead, more than anything else, it shows that our ideas of race are totally unscientific, and basically a myth.

Update on Japanese mtDNA

It turns out the Japanese do have unique mtDNA, but the alignment data provided by the NIH hides this, because it presents the first base of the genome as the first index, without any qualification, as there’s an obvious deletion to the opening sequence of bases. Maybe this is standard, but it’s certainly confusing, and completely wrecks small datasets, where you might not have another sequence with the same deletion. The NIH of course does, and that’s why BLAST returns perfect matches for genomes that contain deletions, and my software didn’t, because I only have 185 genomes.

The underlying paper that the genomes are related to is here:

https://pubmed.ncbi.nlm.nih.gov/34121089/

Again, there’s a blatant deletion in many Japanese mtDNA genomes, right in the opening sequence. This opening sequence is perfectly common to all other populations I sampled, meaning that the Japanese really do have a unique mtDNA genome.

Here’s the opening sequence that’s common globally, right in the opening 15 bases:

GATCACAGGTCTATC

For reference, here’s a Japanese genome with an obvious deletion in the first 15 bases, together for reference with an English genome:

https://www.ncbi.nlm.nih.gov/nuccore/LC597333.1?report=fasta

https://www.ncbi.nlm.nih.gov/nuccore/MK049278.1?report=fasta

Once you account for this by simply shifting the genome, you get perfectly reasonable match counts, around the total size of the mtDNA genome, just like every other population. That said, it’s unique to the Japanese, as far as I know, and that’s quite interesting, especially because they have great health outcomes as far as I’m aware, suggesting that the deletion doesn’t matter, despite being common to literally everyone else (as far as I can tell). Again, literally every other population (using 185 complete genomes) has a perfectly identical opening sequence that is 15 bases long, that is far too long to be the product of chance.

Here’s the updated software that finds the correct alignment accounting for the deletion:

https://www.dropbox.com/s/2lwgtjbzdariiik/Japanese_Delim_CMDNLINE.m?dl=0

Japanese mtDNA

I noticed that some Japanese people seem to have a very low number of bases in common with not only the world, but each other. The dataset I’m using consists of 185 complete genomes, from 19 nationalities, and 3 ancient species, all taken from the NIH Database.

For 2 of the 10 Japanese complete genomes, the maximum number of matching bases anywhere in the world is about 5,000 matching bases. The complete genome has a size of 16,579 bases, and so this is not much better than chance, given by 16,579/4 = 4.145, suggesting that it really is just the operation of chance causing any intersection at all between those Japanese genomes and the global population generally.

This view finds further support in the fact that the entire global population has a perfectly consistent genome (i.e., no variation at all) over the first 15 bases. The probability of this being chance is 1/4190, which is so small, it’s zero in MATLAB. That is, the sequence has a length of 15, and it is common to 175 genomes.

Note this dataset includes 3 complete ancient genomes, specifically, Denisovan, Maritime Archaic, and Homo heidelbergensis, all of which also contain exactly the same globally common sequence. Homo heidelbergensis is thought to have gone extinct hundreds of thousands of years ago, suggesting there is basically zero variation in the opening prefix to human mtDNA.

Said otherwise, globally, there is no mutation at all over the first 15 bases of the human mtDNA genome, anywhere in known history.

This is not true when you include Japan, and in fact, only 1 genome out of 10 is a perfect match, and therefore consistent with the global genome. Instead, the average number of matches excluding that one individual, is 3.2, over the opening prefix of 15 bases.

Putting it all together, you have a global match count for 2 out of 10 Japanese people that seems to be the result of pure chance, and 9 out of 10 Japanese people have a prefix segment that is almost entirely inconsistent with a globally and historically uniform segment of mtDNA.

Has anyone noticed this before or heard other people discussing it? I think it’s consistent with one of two hypotheses:

  1. Japanese mtDNA has a much higher rate of mutation than typical mtDNA, for whatever reason. We could test for this by looking at the rate of change from one generation to the next.
  2. Japanese mtDNA descends from a totally different bacteria.
  3. There was an event that caused a drastic mutation to Japanese mtDNA, and then natural selection took over, and so nothing much changed, since as far as I know, the Japanese have no drastically higher rates of diseases connected to mtDNA, and in fact they have good health outcomes overall.

If either 1 or 3 are true, then it suggests that DNA could have an error correcting function, since single base variants often produce disease, yet here we have drastically inconsistent mtDNA, that doesn’t seem to have any notable problems at all. Note that natural selection would certainly kill off bad outcomes, but it doesn’t produce good outcomes. And so this particular case is at least consistent with the idea that DNA can adjust mutated sequences to avoid malfunction and disease.

In any case, this is highly unusual, since mtDNA is consistent for generations, and in some cases over possibly hundreds of thousands of years. I’ll add the caveat that it could be bad data, despite being from a reputable source, and the opening prefix being inconsistent is perhaps evidence of this.

Here’s the dataset with a ton of code you can use to analyze the data, and here’s the search string for the raw data from the NIH Database.