Determining Order without Entropy

I’m working on another ancestry algorithm, and the premise is really simple: you simply run nearest neighbor on every genome in a dataset. The nearest neighbors will produce a graph, with every genome connected to its nearest neighbor by an edge. Because reality seems to be continuous as a function of a time, small changes in time, should produce small changes in genomes. Because nearest neighbor finds you the smallest difference between a given genome and another over a given dataset, it follows that if genome A is the nearest neighbor of B, then A and B are most proximate in time, at least limited to the dataset in question. However, it’s not clear whether A runs to B, or B runs to A. And this is true, even given a sequence of nearest neighbors, ABC, which could be read either forwards or backwards. That is, all we know is that the genomes A, B, and C are nearest neighbors in that order (i.e., B is the nearest neighbor of A, and C is the nearest neighbor of B).

This is something I came up with a long time ago, using images. Specifically, if you take a set of images from a movie, and remove the order information, you can still construct realistic looking sequences by just using nearest neighbor as described above. This is because reality is continuous, and so images that are played in sequence, where frame i is very similar to frame i + 1, creates convincing looking video, even if it’s the wrong order, or the sequence never really happened at all. I’m pretty sure this is what the supposedly “generative A.I.” algorithms do, and, frankly, I think they stole it from me, since this idea is years old at this point.

However, observing a set of images running backwards will eventually start to look weird, because people will walk backwards, smoke will move the wrong direction, etc., providing visual cues that what you’re looking at isn’t real. This intuitive check is not there with genomes, and so, it’s not obvious how to determine whether the graph generated using nearest neighbor is forwards or backwards in time.

This lead me to an interesting observation, which is that, there’s an abstract principle at work, that the present should have increasingly less in common with the future. Unfortunately, this is true backwards or forwards, but it does add an additional test, that allows us to say, whether or not a sequence of genomes is in some sensible, temporal order. Specifically, using ABC again, A and B, should have more bases in common, than A and C, and this should continue down the sequence. That is, if we had a longer sequence of N genomes, then genome 1 should have less and less in common with genome i, as i increases.

For datasets generally, we still can’t say whether the sequence is backwards or forwards, but we can say whether the sequence is a realistic temporal embedding, and if so, we will know that it yields useful information about order. However, because mutation is random, in the specific case of genomes, if it turns out that A and B contain more bases in common than A and C, then that can’t realistically be read backwards from C to A, which would imply that C randomly mutated to have more bases in common with A, which is not realistic for any appreciable number of bases. This is analogous to smoke running backwards, which just doesn’t happen. However, because of natural selection, we can’t be confident that entropy is increasing from A to C. In fact, if genomes became noisier over time, everyone would probably die. Instead, life gets larger, and more complex, suggesting entropy is actually decreasing. That said, we know that the laws of probability imply that the number of matching bases must decrease between A and all other genomes down the chain, if A is in fact the ancestor of all genomes in the chain. But this is distinct from entropy increasing from A onwards. So the general point remains, you can determine order, without entropy.

UPDATE: It just dawned on me, that if you have a particle that has uniform, random motion, then it’s expected change in position is zero. As a result, if you repeatedly observe that particle moving from a single origin point (i.e., its motion always commences from the same origin), its net motion over time will be roughly zero, but you’ll still have some motion in general. If a given point constantly ends up a terminus in the nearest neighbor construction above, then it’s probably the origin. The reasoning here is that it’s basically impossible for the same point to appear as a terminus, unless it’s the origin. I think this implies that a high in-degree in the nearest neighbor graph over genomes above, implies that genome is a common ancestor, and not a descendant.


Discover more from Information Overload

Subscribe to get the latest posts sent to your email.

Leave a comment