In a paper I wrote entitled A New Model of Computational Genomics [1], I presented a simple test for ancestry that is impossible to argue with. Let |AB| denote the number of matching bases between two genomes A and B. Given genomes A, B, and C, if we assume that genome A is the common ancestor of genomes B and C, then it is almost certainly the case (see [1] for a discussion of the probabilities) that |AB| > |BC| and |AC| > |BC|. That is, genomes A and B, and A and C, almost certainly have more bases in common than genomes B and C. For intuition, beginning with genome A, and assuming independent mutations away from A to genomes B and C, this is like tossing two independent coins (i.e., the mutations within genomes B and C that deviate from A), which should not have more than chance in common. As such, B and C should deviate away from each other at a faster rate than they deviate from A individually.
Now this is already really powerful, and led me to a complete history of mankind, which is more than plausible. But that said, it’s a necessary condition, not a sufficient condition. That is, if genome A is the common ancestor of genomes B and C, then the inequalities above almost certainly hold, but it’s subject to falsification (i.e., it’s not a sufficient condition). I realized tonight, you can actually transform this into a necessary and sufficient condition.
Specifically, the inequality above can be represented as a graph where A is connected to B, A is connected to C, and B is connected to C, with the match counts labelling the edges of the graph. For example, the edge connected A to B would be labeled with |AB|, which will be some integer. If the inequalities are satisfied, only two such graphs out of six are plausible, for the same reasons that underly the inequality. Specially, if I assume A is the ancestor of B, which is in turn the ancestor of C, then A and C almost certainly have fewer bases in common than A and B.
The graphs that remain, imply that if the inequality is satisfied, then A is almost certainly the ancestor of either B or C, or both, as a necessary and sufficient condition. If we plug in an implausible genome for either B or C (e.g., assuming that the Norwegians = A are the ancestors of Heidelbergensis = B), then the inequality serves as a necessary and sufficient condition for the descendants of the Norwegians, i.e., genome C. I will write more about this tomorrow, including code and some testing.
UPDATE 10/19/25
I’ve implemented a new version of the ancestry algorithm, which so far seems to work. Code is attached below, more to come!
Discover more from Information Overload
Subscribe to get the latest posts sent to your email.