September | 2024 | Information Overload

I presented a measure of spatial uncertainty in my paper, Sorting, Information, and Recursion [1], specifically, equation (1). I proved a theorem in [1] that equation (1) is maximized when all of its arguments are equal. See, Theorem 3.2 of [1]. This is really interesting, because the same is true of the Shannon Entropy, which is maximized when all probabilities are equal. They are not the same equation, but they’re similar, and both are rooted in the logarithm. However, my equation takes real number lengths or vectors as inputs, whereas Shannon’s equations takes probabilities as inputs.

I just realized, that Theorem 3.2 in [1], implies the astonishing result, that the order of a set of observations, impacts the uncertainty associated with those observations. That is, we’re used to taking a set of observations, and ignoring the ordinal aspect of the data, unless it’s explicitly a time series. Instead, Theorem 3.2 implies that the order in which the data was generated is always relevant in terms of the uncertainty associated with the data.

This sounds crazy, but I’ve already shown empirically, that these types of results in information theory work out in the real world. See, Information, Knowledge, and Uncertainty [2]. The results in [2], allow us to take a set of classification predictions, and assign a confidence value to them, that are empirically correct, in the sense that accuracy increases as a function of confidence. The extension here, is that spatial uncertainty is also governed by an entropy-type equation, specially equation (1), which is order dependent. We could test this empirically, by simply measuring whether or not prediction error, is actually impacted by order, in an amount greater than chance. That is, we filter predictions, as a function of spatial uncertainty, and test whether or not prediction accuracy improves as we decrease uncertainty.

Perhaps most interesting, because equation (1) is order dependent, if we have an observed uncertainty for a dataset (e.g., implied from prediction error), and we for whatever reason do not know the order in which the observations were made, we can then set equation (1) equal to that observed uncertainty, and solve for potential orderings that produce values approximately equal to that observed uncertainty. This would allow us to take a set of observations, for which the order is unknown, and limit the space of possible orderings, given a known uncertainty, which can again be implied from known error. This could allow for implications regarding order that exceed a given sample rate. That is, if our sample rate is slower than the movement of the system we’re observing, we might be able to restrict the set of possible states of the system using equation (1), thereby effectively improving our sample rate in that regard. Said otherwise, equation (1) could allow us to know about the behavior of a system between the moments we’re able to observe it. Given that humanity already has sensors and cameras with very high sample rates, this could push things even further, giving us visibility into previously inaccessible fractions of time, perhaps illuminating the fundamental unit of time.

I’ve written in the past on the topic of the rate of mutation of mtDNA, in an attempt to calculate the age of mankind. It turns out, there really isn’t a good single answer to the rate at which human mtDNA mutates, and as a result, you really can’t come to any clear answer using mtDNA alone. And in fact, I realized the other day, that it seems to vary by maternal line. Specifically, some modern humans carry archaic mtDNA, in particular Heidelbergensis, Denisovan, and Neanderthal mtDNA. Other modern humans carry mtDNA that is basically identical to ancient mtDNA (e.g., 4,000 years old), but not archaic mtDNA (e.g., 100,000 years old). In particular, many modern humans globally carry Ancient Egyptian mtDNA, from about 4,000 years ago.

You can get an idea of the rate of mutation, by taking e.g., a modern human that has Denisovan mtDNA, and comparing that to a bona fide archaic Denisovan genome, count the number of changed bases, and then divide by the number of years since the archaic sample lived, which will produce a measure of the number of changed bases per year. This can of course be expressed as a percentage of the total genome size, which is what I’ve done below.

We can be a bit fancier about it, by comparing a given genome to many others, producing a distribution of the number of changed bases per year. The code below does exactly this, producing the average total percentage change, minimum total change, maximum total change, and standard deviation over all total changes. The comparison was made only to modern genomes, and so we can take the known (and plainly approximate) date of the archaic / ancient genome, and divide by the number of years to the present. This will produce a rate of change per year, which I’ve expressed as a rate of change per 1,000 years.

The results are as follows:

Genome Type	Avg. Change	Min. Change	Max. Change	Std. Deviation	Genome Date	Avg. Change Per 1000 Years
Denisovan	26.39%	25.76%	32.70%	1.99%	120,000 BP	0.22%
Neanderthal	3.74%	2.79%	36.60%	3.27%	80,000 BP	0.047%
Heidelbergensis	4.27%	3.30%	37.61%	3.30%	430,000 BP	0.0099%
Ancient Egyptian	3.74%	0.17%	35.23%	8.32%	4,000 BP	.935%

Again, note that Denisovan, Neanderthal, and Heidelbergensis are all archaic humans. In contrast, the Ancient Egyptians are of course ancient, but not archaic. The dataset contains 664 rows, 76 of which are archaic or ancient, which leaves 588 rows for the comparisons produced above. As a result, even though the table above was produced using only 4 input genomes, the results were generated comparing each of the 4 input genomes to all 588 complete, modern human mtDNA genomes in the dataset. The plain implication is that modern human mtDNA is evolving faster than archaic human mtDNA, since, e.g., the Ancient Egyptian genome has an average total rate of change equal to that of the Neanderthals, despite having only 4,000 years to achieve this total change, in contrast to the roughly 120,000 years that have passed since the Neanderthal genome. Technically, we should only be testing genomes we believe to be descended from the archaic / ancient genomes, since e.g., it is theoretically possible that a modern person has mtDNA that predates the Ancient Egyptian genome, since mtDNA is so stable. That said, the bottom line is that this is a measure of the variability of a particular maternal line, and the amount of mutation cannot exceed that variability. For this and other reasons, more studies are required, but this is an interesting observation.

The code is below, the balance of the code can be found in my paper, A New Model of Computational Genomics.

Rate of Mutation Download

Information Overload

Month: September 2024

Spatial Uncertainty and Order

The Rate of Mutation of mtDNA