Random Versus Sequential Imputation

Yesterday I presented even more evidence that you get stronger imputation using random bases in a genome, as opposed to sequential bases in a genome. I already presented some evidence for this claim in my paper A New Model of Computational Genomics (see Section 7) [1]. Specifically, I showed in [1] that when calculating the nearest neighbor of a partial genome, you are more likely to map to the true nearest neighbor if you use random bases, as opposed to sequential bases. This suggests that imputation is stronger when using a random set of bases, as opposed to a sequential set of bases. The purpose of Section 7 was to show that using random bases is at least workable, because the model presented is predicated upon the assumption that you don’t need to look for genes or haplogroups to achieve imputation, so I didn’t really care whether or not random bases are strictly superior, though it seems that they are.

Specifically, if you build clusters using a partial genome $A(x)$ , where $x$ is some set of indexes, where another genome $B$ is included in the cluster if $A(x) = B(x)$ , you find that the average total number of matching bases between the full genome $A$ , and all such genomes $B$ , is greater when $x$ is a random set of indexes, versus a sequential set of indexes. Specifically, I tested this for random and sequential indexes, beginning with partial genomes of $1,000$ bases, incrementing by $2,500$ bases each iteration, and terminating at the full genome size of $16,579$ bases (i.e., $x$ starts out with $1,000$ indexes), building clusters for each of the $405$ genomes in the dataset over each iteration. The random indexes are strictly superior, in that the average match count for every $1$ of the $405$ genomes, and the genomes in their respective clusters, is higher when using random indexes, versus sequential indexes. Note that the sequential indexes have a random starting point, and as such, this is not the result of an idiosyncratic portion of the genome.

This might seem surprising, since so much of genetics is predicated upon genes and haplogroups, but it makes perfect sense, since, e.g., proteins are constructed using sequences of $3$ bases. As a consequence, if you concentrate the selected bases in a contiguous sequence, you’re creating overlap, since once you fix $1$ base, the following $2$ bases will likely be partially determined. Therefore, you maximize imputation by spreading the selected bases over the entire genome. Could be there an optimum distribution that isn’t random, yet not sequential? Perhaps, but the point is, random is not only good enough, but better than sequential, and therefore, the model presented in [1] makes perfect sense.