Note on Ramsey’s Theorem

It’s always bothered me that Ramsey’s Theorem is not probabilistic. For example, R(3,3), i.e., the smallest order complete graph that contains either a complete graph on 3 vertices, or an empty graph on 3 vertices, is 6. This means that literally every graph with 6 or more vertices contains either a complete graph on e vertices, or an empty graph on e vertices. This is not probabilistic, because it’s simply true, for all graphs on 6 or more vertices. But it just dawned on me, you can construct a probabilistic view of this fact, which is that on fewer than 6 vertices, the probability is less than one, whereas with 6 or more vertices, the probability is 1. This is true in the literal sense, since less than all graphs with fewer than 6 vertices have a complete graph on 3 vertices, or an empty graph on 3 vertices, but some will. I think this could actually be quite deep, and connect to random graphs, but I need some time to think about it.

Another thought, that I think I’ve expressed before, if we can analogize Ramsey’s Theorem to time, then it would imply that certain structures eventually become permanent. This is a truly strange idea, and though I’m just brain storming, intuitively, it doesn’t sound wrong. And now that I’ve thought a bit more about it, I’ve definitely had this idea before:

Specifically, correlation between two random variables can be thought of as an edge between two vertices, where the vertices represent the variables, and the edge represents the presence or absence of correlation. If we consider all random variables together, then it’s clear that having no correlation at all would correspond to an empty graph, and correlation between all variables would correspond to a complete graph. If all graphs are equally likely, no correlation, and total correlation would be equally likely, and in fact the least likely possibilities for any graph with more than two vertices (when compared to at least some but less than total correlation). As a result, if we randomly select, random variables, we should generally find at least some correlation, regardless of their nature or apparent relationships.

If we imagine time quantized on a line, with a vertex representing a moment, and allow for one moment in time to be related to another moment in time by connecting them with an edge, we will have a graph, that just happens to be visualized along a line. Applying Ramsey Theory, we know that certain structures must emerge over time, since we are allowing for the possibility of ever larger graphs. At the same time, the correlation argument above implies that each moment should have some possibly non-causal connection to other moments, producing non-empty graphs. That is, if one moment is connected to another in the remote past, it’s really not credible that it’s causal, and is instead an artifact of this line of thinking. This argument as a whole implies the possibility that reality has non-causal relationships over time, regardless of whether or not the past, present, or future, is memorialized in any way, and regardless of whether or not the past, present, or future is physically real, because these are immutable, abstract, arguments. All of that said, this is a lot to think about, and I need to organize it a bit more, but the core idea seems sound, and that’s disturbing.

Spatial Uncertainty and Order

I presented a measure of spatial uncertainty in my paper, Sorting, Information, and Recursion [1], specifically, equation (1). I proved a theorem in [1] that equation (1) is maximized when all of its arguments are equal. See, Theorem 3.2 of [1]. This is really interesting, because the same is true of the Shannon Entropy, which is maximized when all probabilities are equal. They are not the same equation, but they’re similar, and both are rooted in the logarithm. However, my equation takes real number lengths or vectors as inputs, whereas Shannon’s equations takes probabilities as inputs.

I just realized, that Theorem 3.2 in [1], implies the astonishing result, that the order of a set of observations, impacts the uncertainty associated with those observations. That is, we’re used to taking a set of observations, and ignoring the ordinal aspect of the data, unless it’s explicitly a time series. Instead, Theorem 3.2 implies that the order in which the data was generated is always relevant in terms of the uncertainty associated with the data.

This sounds crazy, but I’ve already shown empirically, that these types of results in information theory work out in the real world. See, Information, Knowledge, and Uncertainty [2]. The results in [2], allow us to take a set of classification predictions, and assign a confidence value to them, that are empirically correct, in the sense that accuracy increases as a function of confidence. The extension here, is that spatial uncertainty is also governed by an entropy-type equation, specially equation (1), which is order dependent. We could test this empirically, by simply measuring whether or not prediction error, is actually impacted by order, in an amount greater than chance. That is, we filter predictions, as a function of spatial uncertainty, and test whether or not prediction accuracy improves as we decrease uncertainty.

Perhaps most interesting, because equation (1) is order dependent, if we have an observed uncertainty for a dataset (e.g., implied from prediction error), and we for whatever reason do not know the order in which the observations were made, we can then set equation (1) equal to that observed uncertainty, and solve for potential orderings that produce values approximately equal to that observed uncertainty. This would allow us to take a set of observations, for which the order is unknown, and limit the space of possible orderings, given a known uncertainty, which can again be implied from known error. This could allow for implications regarding order that exceed a given sample rate. That is, if our sample rate is slower than the movement of the system we’re observing, we might be able to restrict the set of possible states of the system using equation (1), thereby effectively improving our sample rate in that regard. Said otherwise, equation (1) could allow us to know about the behavior of a system between the moments we’re able to observe it. Given that humanity already has sensors and cameras with very high sample rates, this could push things even further, giving us visibility into previously inaccessible fractions of time, perhaps illuminating the fundamental unit of time.

Knowledge and Utility

I wrote a paper a while back called “Information, Knowledge, and Uncertainty” [1], that presents a mathematical theory of epistemology. I go on to apply it, showing that it can be used in machine learning to drastically improve the accuracy of predictions, using a measure of confidence that follows from the definitions in [1]. In some other research note that I don’t remember the name of, I pointed out that we can also think about a different kind of information that is conveyed through a proof. Specifically, that longer proofs correspond to more computational work, i.e., the work required to prove the theorem, which will have some number of deductive steps. Simply count the steps, the more steps there are, the more work required to prove the result. Now of course, you could have a “bad” and pointlessly long proof for a theorem. Simply posit the existence of a shortest proof, as an analog to the Kolmogorov Complexity. The number of steps in the shortest proof for a theorem is the depth of the theorem.

What caught my attention this morning is the potential connection between utility and the depth of a theorem. For example, the Pythagorean Theorem has very short proofs, and as a result, the shortest proof will necessarily also be short. Despite this, the Pythagorean Theorem is remarkably useful, and has undoubtedly been used relentlessly in architecture, art, and probably plenty of other areas of application. Now you could argue that there is no connection between depth and utility, but perhaps there is. And the reason I think there might be, is because I show that in [1], the more Knowledge you have in a dataset, the more accurate the predictions are, implying utility is a function of Knowledge, which has units of bits.

You can view the number of steps in a proof as computational work, which has units of changes in bits, which is different than bits, but plainly a form information. So the question becomes, is this something universal, in that when information is appropriately measured, that utility becomes a function of information? If this is true, then results like the Graph Minor Theorem and the Four Color Theorem could have profound utility, since these theorems are monstrously deep results. If you’re a cartographer or someone that designs flags, then the Four Color Theorem is already useful, but jokes aside, the point is, at least the potential, for profound utilization of what are currently only theoretical results.

As a self-congratulatory example, I proved a mathematical equivalence between sorting a list of real numbers and the Nearest Neighbor method [2]. The proof is about one page, and I don’t think you can get much shorter than what’s there. But, the point is, in the context of this note, that the utility is unreal, in that machine learning is reduced to sorting a list of numbers (there’s another paper that proves Nearest Neighbor can produce perfect accuracy).

I went on to demonstrate empirically that the necessarily true mathematical results work, in the “Massive” edition of my AutoML software BlackTree AutoML. The results are literally a joke, with my software comically outperforming Neural Networks by an insurmountable margin, with Neural Networks taking over an hour to solve problems solved in less than one second (on a consumer device) using BlackTree, with basically the same accuracy in general. Obviously, this is going to have a big impact on the world, but the real point is, what do the applications of something like the Graph Minor Theorem even look like? I have no idea. There’s another theorem in [2] regarding the maximization of some entropy-like function over vectors, and I have no idea what it means, but it’s true. I’ve dabbled with its applications, and it looks like some kind of thermodynamics thing, but I don’t know, and this is disturbing. Because again, if true, it implies that the bulk of human accomplishment has yet to occur, and it might not ever occur because our leaders are a bunch of maggots, but, if we survive, then I think the vast majority of what’s possible is yet to come.

All of that said, I’m certainly not the first person to notice that mathematics often runs ahead of e.g., physics, but I’m pretty sure I’m the first person to notice the connection (if it exists) between information and utility, at least in a somewhat formal manner. If this is real, then humanity has only scratched the surface of the applications of mathematics to reality itself, plainly beyond physics.