Random Versus Sequential Imputation

Yesterday I presented even more evidence that you get stronger imputation using random bases in a genome, as opposed to sequential bases in a genome. I already presented some evidence for this claim in my paper A New Model of Computational Genomics (see Section 7) [1]. Specifically, I showed in [1] that when calculating the nearest neighbor of a partial genome, you are more likely to map to the true nearest neighbor if you use random bases, as opposed to sequential bases. This suggests that imputation is stronger when using a random set of bases, as opposed to a sequential set of bases. The purpose of Section 7 was to show that using random bases is at least workable, because the model presented is predicated upon the assumption that you don’t need to look for genes or haplogroups to achieve imputation, so I didn’t really care whether or not random bases are strictly superior, though it seems that they are.

Specifically, if you build clusters using a partial genome A(x), where x is some set of indexes, where another genome B is included in the cluster if A(x) = B(x), you find that the average total number of matching bases between the full genome A, and all such genomes B, is greater when x is a random set of indexes, versus a sequential set of indexes. Specifically, I tested this for random and sequential indexes, beginning with partial genomes of 1,000 bases, incrementing by 2,500 bases each iteration, and terminating at the full genome size of 16,579 bases (i.e., x starts out with 1,000 indexes), building clusters for each of the 405 genomes in the dataset over each iteration. The random indexes are strictly superior, in that the average match count for every 1 of the 405 genomes, and the genomes in their respective clusters, is higher when using random indexes, versus sequential indexes. Note that the sequential indexes have a random starting point, and as such, this is not the result of an idiosyncratic portion of the genome.

This might seem surprising, since so much of genetics is predicated upon genes and haplogroups, but it makes perfect sense, since, e.g., proteins are constructed using sequences of 3 bases. As a consequence, if you concentrate the selected bases in a contiguous sequence, you’re creating overlap, since once you fix 1 base, the following 2 bases will likely be partially determined. Therefore, you maximize imputation by spreading the selected bases over the entire genome. Could be there an optimum distribution that isn’t random, yet not sequential? Perhaps, but the point is, random is not only good enough, but better than sequential, and therefore, the model presented in [1] makes perfect sense.

Here’s the dataset and the code:

https://www.dropbox.com/s/ht5g2rqg090himo/mtDNA.zip?dl=0

https://www.dropbox.com/s/9itnwc1ey92bg4o/Seq_versu_Random_Clusters_CMDNLINE.m?dl=0

Another Note on Imputation

In my most recent paper, A New Model of Computational Genomics [1], I showed that a genome is more likely to map to its true nearest neighbor, if you consider a random subset of bases, versus a sequential set of bases. Specifically, let x be a vector of integers, viewed as the indexes of some genome. Let A be a genome, and let A(x) denote the bases of A, as indexed by x, within A. That is, A(x) is the subset of the full genome A, limited to the consideration of the bases identified in x. We can then run Nearest Neighbor on A(x), which will return some genome B. If x is the full set of genome indexes, then B will be the true nearest neighbor of A.

The results in Section 7 of [1] show that as you increase the size of x, you end up mapping to the true nearest neighbor more often, suggesting that imputation becomes stronger as you increase the number of known bases (i.e., the size of x). This is not surprising, and my real purpose was to prove that statistical imputation (i.e., using random indexes in x) was at least acceptable, compared to sequential imputation (i.e., using sequential indexes in x), which is closer to searching for known genes, and imputing remaining bases. It turns out random bases are actually strictly superior, which you can see below.

The number of genomes that map to their true nearest neighbor, as a function of the number of bases considered. The orange curve above is the result of a random set of indexes of a given size, and the blue curve below is the result of a sequential set of indexes of the same size.

It turns out imputation seems to be strictly superior when using random bases, as opposed to sequential bases. Specifically, I did basically the same thing again, except this time I fixed a sequential set of bases of length L, x_S, with a random starting index, and then also fixed L random bases x_R. The random starting index for x_S is to ensure I’m not repeatedly sampling an idiosyncratic portion of the genome. I then counted how many genomes contained the sequence A(x_S), and counted how many genomes contained the sequence A(x_R). If random bases generate stronger imputation, then fewer genomes should contain the sequence A(x_R). That is, if you get better imputation using random bases, then the resultant sequence should be less common, returning a smaller set of genomes that contain the sequence in question. This appears to be the case empirically, as I did this for every genome in the dataset below, which contains 405 complete mtDNA genomes from the National Institute of Health.

Attached is code that lets you test this for yourself. Below is a plot that shows the percentage of times sequential imputation is superior to random imputation (i.e., the number of success divided by 405), as a function of the size of x, which starts out at 1,000 bases, increases by 2,500 bases per iteration, and peaks at the full genome size of 16,579 bases. You’ll note it quickly goes to zero.

The percentage of times sequential imputation is superior to random imputation, as a function of the number of bases considered.

This suggests that imputation is local, and that by increasing the distances between the sampled bases, you therefore increase the strength of the overall imputation, since you minimize the intersection of any information generated by nearby bases. The real test is actually counting how many bases are in common outside a given x, and testing whether random or sequential is superior, and I’ll do that tomorrow.

https://www.dropbox.com/s/9itnwc1ey92bg4o/Seq_versu_Random_Clusters_CMDNLINE.m?dl=0

https://www.dropbox.com/s/ht5g2rqg090himo/mtDNA.zip?dl=0

On Perfect Knowledge

My paper Information, Knowledge, and Uncertainty [1] implies the superficially awkward conclusion that a perfectly consistent set of observations, carries no Knowledge at all. This follows from the fundamental equation in [1], which assumes that Knowledge is the balance of Information less Uncertainty. Symbolically,

I = K + U,

which in turn implies that K = I - U. In the case of a single observation of a given system, Information is assumed to be given by the maximum entropy of the system given its states, and so a system with N possible states has an Information of \log(N). The Uncertainty is instead given by the entropy of the distribution of states, which could of course be less than the maximum entropy given by I = \log(N). If it turns out that U < I, then K > 0. All of this makes intuitive sense, since, e.g., a low entropy distribution carries very low Uncertainty, since it must have at least one high probability event, making the system at least somewhat predictable.

The strange case is a truly certain event, which would cause the entropy of the distribution to be zero. This in turn sets all measures to zero, implying zero Information, Knowledge, and Uncertainty. However, this makes sense if you accept Shannon’s measure of entropy, since a source with a single certain event requires zero bits to encode. Similarly, a source with a single certain event carries no Uncertainty, for exactly that reason. You could use this to argue that there’s a special case of the equation above that doesn’t really make any sense, but this is actually wrong. Specifically, you still have to have a system in the first instance, it’s just in a constant state. Such systems are physically real, albeit temporarily, e.g., a broken clock. Similarly, a source that generates only one signal still has to exist in the first instance. And as such, you have no Uncertainty with respect to something that is actually extant. In contrast, when you have no Uncertainty with respect to nothing, that’s not really notable in any meaningful or practical way. The conclusion being, that zero Knowledge coupled with zero Uncertainty, with respect to a real system, is physically meaningful, because it means that you know its state with absolute certainty. You have the maximum possible Knowledge, it just happens to be, that quantity is zero in the case of a static system.

At the risk of being overly philosophical, if we consider the set of all mathematical theorems, which must be infinite in number for the simple reason that trivial deductions are themselves theorems, then we find a fixed set, which is immutable. As a consequence, perfect Knowledge of that set would have a measure of zero bits. To make this more intuitive, consider the set of all mathematical statements, and assign each a truth value of either true or false. If you do not know the truth value of each statement, then you are considering what is from your perspective a dynamic system, which could change as information becomes available (e.g., you prove a statement false). If instead you do know the truth value of each statement, then it is a fixed system with zero Uncertainty, and therefore zero Knowledge.

On Significant Mutations

I think I’m finally done with my work on mtDNA, which I’ve summarized in my paper, A New Model of Computational Genomics, though I spent some time today thinking about how it is that significant mutations occur. Specifically, if you run a BLAST Search on a human mtDNA genome, and you compare that to a gorilla, you’ll see that there’s a 577 base sequence that humans have, that gorilla’s do not, suggesting that humans are the result of a 577 base insertion into the mtDNA of a gorilla. You get a similar result with chimps and other similar species. Here’s a screen shot of a BLAST Search I ran comparing this human mtDNA genome to a gorilla’s, and you can see the Query genome (i.e., the human genome) begins at index 577, and the Subject genome (i.e. the gorilla genome) begins at index 1, suggesting that the human genome contains 577 bases that are simply absent from the gorilla genome.

Screen shot from the NIH website.

This isn’t necessarily the case, but the result is consistent with the assumption that a significant insertion into a gorilla’s mtDNA, produced human mtDNA. This is obviously also consistent with evolution, but the question is, how could such a massive error occur in a healthy species? I think the answer is that it’s not a normal insertion, and instead, an already assembled, yet free-floating segment of DNA ends up attached to the end of a strand that is being assembled. That is, there’s some detritus floating around, that is an already formed strand of DNA, that ends up attached to one of the ends of another strand that is in the process of being assembled. This shouldn’t occur often either, but if it did, it wouldn’t imply that the genetic machinery is broken, which would almost certainly be the case given an insertion that is 577 bases long. That is, if it just happened to be the case that some left over strand ends up attached to another strand in the midst of being assembled, then that’s just a low probability event that is not indicative of anything wrong. In contrast, if there’s an inadvertent 577 base insertion, then the genetic machinery is broken, and will almost certainly produce lethal diseases in short order.

That said, this exogenous insertion must also not be deleterious, in order for it to persist. This is of course perfectly consistent with evolution, and at the same time, consistent with a modern understanding of genetic replication, where small errors often produce disastrous and even lethal diseases. The net result would be, a healthy species just happened to experience an unlikely event that caused a piece of stray DNA to become attached to another piece during replication, and this exogenous insertion turned out to be either benign or beneficial. This would allow for significant mutations, possibly allowing for one species to mutate into another.

On Insertions and Deletions

I’ve been writing about alignment quite a bit lately, since my method of genomics makes use of a very simple alignment that follows the default NIH alignment, which you can see looking at the opening bases of basically any genome. This makes things really simple, and allows you to quickly compare entire genomes. However, I noted that in the case of my method of ancestry analysis, you actually should at least consider the possibility of local alignments, even though it doesn’t seem to matter very much.

I’m now convinced you should not consider local alignments unless you’re looking for genes, insertions, or deletions, because as I suspected, it turns out that insertions and deletions appear to define maternal lines. Moreover, insertions and deletions are associated with drastic changes in behavior and morphology, e.g., Down Syndrome and Williams Syndrome, unlike single-base mutations, which can cause diseases, but it’s obvious that plenty of people differ by many bases over even ideal alignments, so they’re plainly not as important as indels.

Specifically, I wrote an algorithm that iterates over every possible global alignment between two genomes, and for the Iberian Roma population (a nearly perfectly homogenous population), the alignment that maximizes the number of matching bases, when comparing two genomes from that population, is the default NIH alignment. The Iberian Roma are very closely related to the people of Papua New Guinea, and the same is true of them. However, for the Kazakh and Italian populations, this is not the case, with many genomes requiring some changes to alignment, implying insertions and deletions. These insertions and deletions therefore in turn plainly define different maternal lines within a given population, and among populations. As a consequence, again, I think the right method is to fix the global alignment using the default NIH alignment, and then compare entire genomes.

Attached is the dataset, and some code that runs through every possible global alignment.

https://www.dropbox.com/s/ht5g2rqg090himo/mtDNA.zip?dl=0

https://www.dropbox.com/s/4h1myonndzkgfts/Count_Matching_Bases.m?dl=0

https://www.dropbox.com/s/ojmo0kw8a26g3n5/find_sequence_in_genome.m?dl=0

https://www.dropbox.com/s/p22as65hh9brpcv/Find_Seq_CMNDLINE.m?dl=0

Another Note on Alignment

I noted yesterday that not only is local alignment intractable (if you treat every base index as a potential insertion / deletion), it doesn’t seem to matter much compared to global alignment. I actually tested this experimentally on the mtDNA dataset I’ve put together, and it turns out that in the best case, you add an average of 41.42 matching bases by accounting for local alignment. Specifically, I ran Nearest Neighbor on every genome, and then accounted for all sequences of unequal bases between a given genome and its nearest neighbor, that were at least 10 bases long. This is necessary otherwise you end up shifting bases locally in sequences that are short, and therefore could be the result of chance. In some cases local alignment does add a significant number of bases, and you can see that in the chart below that shows the maximum percentage increase that could be achieved using local alignments over sequences of at least 10 bases long. However, you can also see, it’s extremely rare, and typically around 0 bases. The x-axis shows the genome index in the dataset, and the y-axis shows the ratio between (a) the extra matching bases due to local alignment divided by (b) the original matching bases count, which peaks around 15% of the genome size. That is, it shows the percentage increase in matching bases due to accounting for local alignment. Moreover, this is the absolute maximum, that assumes a single shift in a sequence of M bases will produce M-1 matching bases, which is obviously not guaranteed. The plain take away is, unless you’re looking for genes, or looking for insertions and deletions, local alignment is probably not important. Specifically, my entire model is predicated upon allowing a machine to examine an entire genome, rather than look at individual genes or regions. Ironically, this algorithm is probably useful for identifying potential insertions and deletions, for the simple reason that it identifies significantly long sequences of unequal bases.

Here’s the dataset:

https://www.dropbox.com/s/ht5g2rqg090himo/mtDNA.zip?dl=0

All genomes are taken from the NIH, and each has a provenance file that links to the NIH Database.

Here’s the code:

https://www.dropbox.com/s/sz032x9py1vk3qf/Calc_Max_Match.m?dl=0

https://www.dropbox.com/s/npnz8ljzoh3sxzy/Max_Match_Count_CMNDLINE.m?dl=0

A Note on Alignment

I was working on a problem involving Heidelbergensis, and it dawned on me that local alignment, and global alignment, are fundamentally different problems. Specifically, if you want to find an optimum global alignment for mtDNA, you can shift the genome incrementally, and compare it to some reference genome, until you maximize the number of matching bases. If you do this locally, the arguably correct way to do this, is to treat every base index, as a potential insertion or deletion. This is intractable, despite the fact that mtDNA is finite. This is obviously not a sensible way to attack the problem. In fact, because you’re definitely going to get non-linear changes in match count as a function of shifting, there’s no way that a generalized optimization algorithm will solve this problem. This implies that as a general matter, global alignments are the correct way to align mtNDA.

Heidelbergensis as Ancestor

I discovered an ancient Chinese genome in the NIH Database that implies that the Iberian Romani predate Heidelbergensis. The reasoning is straightforward, and impossible to argue with. Specifically, if mtDNA genome A is the ancestor of both genomes B and C, then it is almost certainly the case (as a matter of probability) that genomes A and B, and A and C, have more bases in common than genomes B and C. That is, A has more in common with both B and C, than B and C have in common with each other. This follows from basic probability, which you can read about in Section 6 of my paper, A New Model of Computational Genomics. The intuition is simple, specifically, that if you fix a set of bases in genomes B and C (i.e., those inherited from ancestor genome A), then genomes B and C are almost certainly going to diverge from that set as they mutate over time, rather than randomly develop new bases in common by chance. As a consequence, assuming they both descend from A, they should not develop even more bases in common as a function of time.

In this particular case, fixing genome A as Iberian Romani, genome B as Heidelbergensis, and genome C as the ancient Chinese genome, we find that A and B have 97% of bases in common, A and C have 65% of bases in common, and B and C have 63% of bases in common. As a consequence, the most likely arrangement is that A (the Iberian Romani), are the ancestors of both B and C. This doesn’t imply that it is the case, but it is the most likely case, since assuming Heidelbergensis is the ancestor of the other two, requires assuming that the Iberian Roma and the ancient Chinese genome spontaneously developed 331 additional bases (i.e., 2% of the full genome) in common by chance, which is extremely unlikely. If the Romani in fact predate Heidelbergensis, they would almost certainly be the most ancient living humans. The fact that they are a 96% match to Heidelbergensis is alone compelling evidence for this claim.

Moreover, even if you account for local alignment, you end up with the Iberian Roma and Heidelbergensis equally likely to be the ancestor of the other. Specifically, assuming you’ve maximized the global alignment (i.e., shifted the genomes as a whole to maximize the percentage of matching bases), the best you can do after that is to account for local insertions and deletions. These will appear in the gaps between matching bases. It turns out, even if you make the best case assumption, which is that a shift by 1 in a gap of length M will produce M-1 matching bases, you still end up with A and B having 99.93% of bases in common, A and C having 99.83% of bases in common, and B and C having 99.83% of their bases in common. This implies that both the Iberian Roma and Heidelbergensis are equally likely to be the common ancestor genome. Note that this is arguably bad practice, because it assumes a large number of small shifts, that could of course be the result of chance. The bottom line conclusion is that the Iberian Roma are seriously ancient people. The Iberian Roma are also a 99% match for the Papuans in Papua New Guinea.

An interesting observation during this process, if you consider only gaps of appreciable length, you barely move the match count. I’ll test this tomorrow, but it suggests that once you fix the global alignment, the local alignments that are statistically meaningful (i.e., too long to be the credible result of chance), don’t add anything material to the match count, even under the best case assumption of simply assuming the entire gap would match if shifted by 1. It also suggests again, the Roma are more likely to be the ancestor of the three genomes, since considering only long gaps (i.e., at least 10 bases long), barely changed the match counts and didn’t change their ordinal relationships.

Here’s the dataset. All of the genomes are taken from the NIH, and have provenance files with links to the NIH Database.

https://www.dropbox.com/s/ht5g2rqg090himo/mtDNA.zip?dl=0

Computing Ancestry

I presented an algorithm that builds a graph showing possible ancestral connections among genomes, which you can find in Section 6 of my paper, A New Model of Computational Genomics. The basic idea is that given genomes A, B, and C, if genome A is the ancestor genome of both genomes B and C, then it is almost certainly the case (as a matter of probability) that genomes A and B, and A and C, will have more in common with each other than genomes B and C. This is for the simple reason that it is far more likely that both genomes B and C will mutate away from genome A, divergently, than it is that both B and C will somehow spontaneously develop common bases.

For exactly the same reason, if two genomes A and B have more than 25% of their bases in common (i.e., chance), but less than 100% of their bases in common, then they almost certainly have an ancestral connection. Specifically, there are exactly three possibilities: (i) genomes A and B have a common ancestor; (ii) genome A is the ancestor of genome B; (iii) genome B is the ancestor of genome A. You can’t say which is the case, but the point is, there must be an ancestral relationship, as a consequence of basic probability. This becomes more compelling as the percentage increases above 25%, and decrease below 100%, and becomes basically impossible to argue with quickly in both cases.

As such, the attached code sets a window within which two genomes are treated as a match, with the minimum match set to 70%, and the maximum match set to 96%. I came up with these numbers because a significant portion of the global population is a 70% match with Denisovan mtDNA, and a large portion of the global population is a 96% match with Heidelbergensis, suggesting that if an ancestral relationship exists over even an enormous amount of time (i.e., hundreds of thousands of years), you shouldn’t be much further off than that.

Specifically, 100% of both the Iberian Roma and Papuans (i.e., from Papua New Guinea) in the dataset below are a 96% match with Heidelbergensis. As a consequence, they must be truly ancient people, since Heidelbergensis is believed to have gone extinct hundreds of thousands of years ago. They must be a mutation off of Heidelbergensis, or even more interestingly, possibly predate or have a common ancestor with Heidelbergensis. Therefore, in every case, the Romani and Papuans must be hundreds of thousands of years old, it simply must be true, or we’re wrong about when Heidelbergensis went extinct.

That follows from basic probability (again see Section 6 of the paper above), but what’s really interesting, using the algorithm below, is that it seems a lot of people have an ancestral connection to the Phoenicians, including the Scandinavians, which is something I hypothesized a long time ago, because of the fact that they’re both ship-building people, that lived in city-states, and also since some ancient Runes (i.e., the Viking alphabet), appear to be Semitic. They also seem to have gods in common, specifically Adon and Odin, and their sons Baal and Baldr, Canaanite and Norse, respectively. Here’s the distribution of potential ancestral relationships for the Norwegian genomes in the dataset, and you’ll note the plain connection to the Phoenicians (acronym PH), who are in turn also closely related to the Sardinians (acronym SR). Note that the dataset has been diligenced to ensure that e.g., a Norwegian genome is collected from an ethnically Norwegian person, as opposed to a person located in Norway. All genomes are taken from the NIH Database, and the dataset is therefore courtesy of the NIH.

The height of each column shows the percentage of the maximum possible number of matching genomes for each population.

Here’s the command line code, any subroutines can be found in A New Model of Computational Genomics, together with the dataset itself:

https://www.dropbox.com/s/w1m2j5lsvj232ku/Ancestral_Connections_CMDNLINE.m?dl=0