A Note on the Logarithm of Infinite Cardinals

I thought I had posted something about the logarithm of \aleph_0, which is something I think about at times, but I can’t seem to find it, so I’ll begin from scratch:

Lemma 1. The logarithm of \aleph_0 is not equal to \aleph_0.

Proof. Assume that \aleph_0 = \log(\aleph_0). If we assume that \aleph_1 = |\mathbb{R}| = 2^{\aleph_0}, then it follows that \aleph_1 = 2^{\log(\aleph_0)}, which leads to a contradiction, since 2^{\log(\aleph_0)} = \aleph_0.

Definitions. As a consequence, I’ve defined a new number, \Omega = \log(\aleph_0), that I’ve used in apparently unpublished research to think about the number of bits you’d need to represent a countably infinite system. However, it just dawned on me that you can construct a system that has a number of bits equal to \Omega as follows:

Assume that a system S is comprised of a countable number of rows, each indexed by the natural numbers, starting with 1. Now assume that for each row, you have a number of bits equal to the index of the row. So for example, in row 1, you have 1 bit, which can be in exactly 2^1 states, and in row 2, 2 bits, which can be in exactly 2^2 = 4 states, etc. However, assume S has a tape head, like a UTM, that can move both up and down rows, and left and right, thereby reading or writing to the bits in the row at its current index. So for example, if the head is currently at row 2, it can move left or right, to access the 2 bits that are in that row. Further, assume that if the head moves either up or down, then all of the bits in every row reset to zero. So this means that once the tape head changes rows, all of the information in the entire system is effectively deleted.

The number of states that S can occupy is at least \aleph_0, since it can represent every natural number. However, S is distinct from a countably infinite binary string, since changing the row of the tape head causes the entire system to reset, which implies that only a finite number of bits, all within a single row, can ever be used in combination with one another. The number of bits in S is therefore unbounded, but not equal to \aleph_0, because they cannot be used in combination with one another, outside of a single row.

Lemma  2. The number of states of S cannot be greater than \aleph_0.

Proof. We can assign finite, mutually exclusive subsets of \mathbb{N} to each row i of S with a cardinality of at least 2^i (e.g., use mutually exclusive subsets of powers of primes). Therefore, the number of states of S cannot be greater than the number of elements in \mathbb{N}, which is equal to \aleph_0.

Corollary  3. The number of states of S is equal to \aleph_0.

Proof. Every natural number corresponds to at least one state of S, which implies that the number of states of S is at least \aleph_0. However, Lemma 2 implies that the number of states of S cannot be greater than \aleph_0, which implies that they are equal.

Assumption. The number of bits in a system is equal to the logarithm of the number of states of the system.

This is a standard assumption for finite systems, since it implies that a system with N states is equivalent to \log(N) bits. In this case, it implies that S is equivalent to \Omega = \log(\aleph_0) bits, since S has exactly \aleph_0 states, just like the set of natural numbers itself.

Lemma 4. \Omega + c = \Omega, for all c \in \mathbb{R}.

Proof. Assume \Omega + c = X. It follows that 2^{\Omega + c} = 2^{\Omega} 2^{c} = \aleph_0 2^c = \aleph_0, which implies X = \Omega.

Corollary 5. \Omega \notin \mathbb{N}.

Proof. This follows immediately from Lemma 4, since there is no such natural number.

Note that \Omega - 1 = \Omega, which in turn implies that we can remove any single row from S, without reducing the number of bits in S, which is clearly the case.

Corollary  6. \Omega < \aleph_0.

Proof. We can assign finite, mutually exclusive subsets of \mathbb{N} to each row i of S with a cardinality of at least i (e.g., use mutually exclusive subsets of powers of primes). Therefore, the number of bits in S cannot be greater than the number of elements in \mathbb{N}, which is equal to \aleph_0. However, Lemma 1 implies that \Omega \neq \aleph_0, and therefore, \Omega < \aleph_0.

Corollary  7. \Omega > n, for all n \in \mathbb{N}.

Proof. For any given n \in \mathbb{N}, we can always find a row i of S such that i > n. Therefore, the number of bits in S is greater than n, for all n \in \mathbb{N}.

Corollary  8. There is no set A \subseteq \mathbb{N} with a cardinality of \Omega.

Proof. Assume such a set A exists. Because \Omega > n, for all n \in \mathbb{N}, |A| cannot be finite. Because |A| < |\mathbb{N}|, A cannot be countable. All subsets of \mathbb{N} are comprised of either a finite, or countable number of elements, which completes the proof.

So to summarize, the number of bits in S is sufficient to represent all natural numbers, and the number of states of S is equal to the cardinality of the natural numbers. If we assume as a general matter, that the number of bits in a system is equal to the logarithm of the number of states of the system, then the number of bits in S is \Omega = \log(\aleph_0).

Corollary  9. There is no number of bits X < \Omega, such that X > n, for all n \in \mathbb{N}.

Proof. Assume that such a number exists, and that for some system \bar{S}, the number of states of \bar{S} is 2^X. If 2^X = \aleph_0, then X = \Omega, which leads to a contradiction, so it must be the case that 2^X \neq \aleph_0. Because X < \Omega, the number of states of \bar{S} must be less than the number of states of S, and so 2^X < \aleph_0. This implies that we can assign each state of \bar{S} a unique natural number, which in turn implies the existence of a set of natural numbers A, such that |A| > n, for all n \in \mathbb{N}, and |A| < \aleph_0. Since the existence of \bar{S} implies the existence of A, which leads to a contradiction, \bar{S} does not exist.

Corollary  10. There is no set A such that |A| = \Omega.

Proof. Assume such a set A exists. Since \Omega < \aleph_0, we can assign each element of A a unique natural number, which will in the aggregate generate a set B \subset \mathbb{N}. The existence of B contradicts Corollary 8, which completes the proof.

This work implies that \Omega is the smallest non-finite number, and that it does not correspond to the cardinality of any set.

This in turn implies an elegant theory of number itself:

Begin with a system S_0, that is always in the same state s_0, and assume that the number of states of S_0 is 1, and that the number of bits in S_0 is 0 = \log(1). Now assume the existence of S_1, that is always in the same state s_1, and that this state is distinct from s_0. Further, assume the existence of some system \bar{S} that can be in either s_0 or s_1, and assume that the number of states of \bar{S} is 2, and that the number of bits in \bar{S} is 1 = \log(2). If we allow for unbounded instances of \bar{S} to exist in conjunction with one another, then basic counting principles imply the existence of all natural numbers, since a collection of systems, each equivalent to \bar{S}, is in turn equivalent to some binary string.

Note that in this view, the number 0 always has units of bits, and is never some number of states, just like \Omega must always have units of bits, since it does not correspond to the cardinality of any set, or number of states. It suggests the notion that some numbers must necessarily cary particular units, because they can only be derived from one class of mathematical object, that necessarily implies those units. For example, in this view, 0 must always carry units of bits, and can never cary units of states. In contrast, 1 can be either a number of bits, or a number of states.

Interestingly, Shannon’s equation implies that the logarithm of the reciprocal of a probability, \log(\frac{1}{p}) has units of bits. The work above implies that the logarithm of some number of states, \log(N) also has units of bits. Together, this implies that a probability has units of the reciprocal of some number of states, p = \frac{1}{N}. This is perfectly consistent with a frequentist view of a probability, which would be expressed as some portion of some number of states. Again, all of this work suggests the possibility that numbers have intrinsic units, apart from any physical measurement.

Two Lemmas on Prediction and Complexity

Introduction

In a previous article, I introduced a definition of a random variable that says, in simple terms, a source is random, and observing that source will in turn generate a random variable, if the source generates a sequence of observations that is eventually, approximately, Kolmogorov-random.

Note that my definition of a random variable is close enough to Per Martin-Löf’s, that it’s possible he, or someone else, already proved variations of these results –

I noticed these results in the context of my work on A.I. and physics, which are my primary areas of interest.

The Lemmas

For simplicity, let’s assume that observing the source generates truly Kolmogorov-random strings, since there is some nuance to the definition I offered in the previous article that adds only confusion, with no meaningful upside.

Recall that if a binary string x is Kolmogorov-random, then the Kolmogorov complexity of x, denoted K(x), which is the shortest input to a UTM, y, that will generate x, is such that the length of y, denoted |y|, is equal to |x| + c, for some constant c that does not depend upon x. For example, if c is the length of the “print” function, then because any string can be generated on a UTM by providing the string in question as an argument to the “print” function, it follows that for all strings x, K(x) \leq |x| + c.

Given a binary string x, let x_n denote the first n entries of x, and let U(x) denote the output of a UTM when given x as input.

Lemma 1. If a binary string x_n is Kolmogorov-random, for all n \in \mathbb{N}, then there is no computable function f that can generate the next n - k entries of x, given the first k entries of x, for all n,k \in \mathbb{N}.

Proof. Assume such a function f exists. It follows that, given the first k entries of x, namely, x_k =(x_1,x_2, \ldots, x_k), we can generate the next n - k entries (x_{k+1}, \ldots, x_n) on a UTM, using f, specifying n. Because x_n is Kolmogorov-random, and because we can specify any integer m with at most \log(m) bits, it follows that K(x_n) \leq |f| + k + \log(n), which implies that, n - \log(n) \leq |f| - c + k. Note that the difference |f| - c is constant, and because we can, for any fixed value of k, choose an arbitrarily large value of n, there is, therefore, no such function f that exists for all n,k \in \mathbb{N}, which completes the proof by contradiction.

Lemma 2. If a string x_n is Kolmogorov-random, for all n \in \mathbb{N}, and there is some computable function f that can generate an unbounded number of entries of x, then f is subject to the inequality below.

Proof. Assume such a function f exists. It follows that f can generate some unbounded number of entries of x, namely S = \{x_{n_1}, x_{n_2}, \ldots, \}, where n_i is the index of some entry within x, each listed in order. Let k \in \mathbb{N}. Because f can generate an unbounded number of entries of x, we can provide an argument z, such that U(f,z) will generate exactly k entries of x, in order. It follows that we can construct a substring of x on a UTM, that contains k entries of x, in order, with any missing entries represented by some special character. Call the resultant string s, and assume that n = |s|. Note that K(z) \leq \log(k). This implies that K(s) \leq |f| + \log(k).

Note that we can generate x_n given s, the list of indexes that correspond to the missing entries within s, and a string generated by concatenating the missing entries of s that are within x_n into a single string, y. Because the maximum index of any entry within y is n, it follows that we can specify any such individual index with at most \log(n) bits. Since there are n - k entries in y, it follows that we can specify all of the indexes within y with at most (n - k) \log(n) bits.

This implies that,

K(x_n) \leq K(s) + K(y) + (n - k) \log(n) \leq [|f| + \log(k)] + (n - k) + (n - k) \log(n).

This in turn implies that,

n + c \leq |f| + \log(k) + (n - k) + (n - k) \log(n),

and so,

c \leq |f| + \log(k) - k + (n - k) \log(n).

If such a function f exists, then this is a surprising result, since it implies that you can predict an unbounded number of non-sequential observations of a random variable, albeit subject to constraints. The question is, is there another reason why such a function cannot exist?

Vectorized Image Partitioning

I’ve implemented my vectorized rendition of my original image partition algorithm, and the results are excellent, and remarkably efficient. I’ve yet to apply it to a formal classification problem, but the results seem to be just as good as my original algorithm, and the runtimes are orders of magnitude smaller than the original algorithm. This implies that it should work just as well as my original algorithm as preprocessing for image classification problems, the only difference being that it’s much more efficient.

Here are some examples, which include the original image, the boundaries generated by the partition, and the average color of each partition region. The average color of each partition region would then be fed as a matrix / vector that represents the image (effectively a collection of super pixels) as the input to a classification algorithm.

Below each panel is the original image size, and the runtime of the partition algorithm, as applied to each image.

image size = 960 x 774 x 3
runtime = 0.61713 seconds

image size = 2048 x 1536 x 3
runtime = 3.1197 seconds

image size = 3722 x 1635 x 3
runtime = 7.2745 seconds

Octave Code:

Some Thoughts on Random Variables

I’ve been working on a new model of probability theory in my free time, and though I’m nowhere near done with it, I’ve come up with what I believe to be a fairly elegant definition of a random variable that I think captures what it is that I’m trying to describe, which is ultimately rooted in computer theory and information theory:

The basic idea is that if a source is truly random, then prior observations should have no impact on future observations. We can express this rigorously by saying that a source S generates signals over the alphabet \Sigma, and that for any N observations, regardless of any prior observations, the set of possible sequences of observations is given by the full set of N^{|\Sigma|} strings. What this says is, the set of possible outcomes never changes, regardless of how many observations you’ve already made. One consequence of this is that you always have the same uncertainty with respect to your next observation, which could produce any one of the signals in \Sigma.

However, if the source is truly random, then it should produce a roughly Kolmogorov random string, eventually. If that’s not the case, then there will always be some computable process that can generate the observations in question. For example, the digits of a computable real number like \pi or e might seem superficially random, but they’re not, and are instead entirely determined ex ante by a computable process. If a source is truly random, then intuition suggests that eventually, it will evade modeling by a UTM, which implies that with a significantly large enough number of observations, the string of observations generated by the source should approach the complexity of a Kolmogorov random string.

We can express this rigorously by saying that for every real number \delta, and every number of observations n, there exists a number of observations N > n, for which,

1 - \frac{K(x_N)}{|x_N|}  < \delta,

where x_N is the string generated by making N observations of the source. What this says is, we can always make a number of observations that will bring the resultant string of observations arbitrarily close to a Kolmogorov random string, but this does not require actual convergence in the limit. Note that this definition does not require convergence to any particular distribution either, which is certainly possible for some physical systems, which could for example, simply change distributions as a function of time, or never have a stable distribution of states at all.

On Complexity and Control

One of the first useful conclusions I came to in applying information theory to physics was that you could use Shannon entropy to measure the complexity of the motion of an object (see Section 1, generally). For example, light always travels in a perfectly straight line in a vacuum, and so it has a zero entropy distribution of motion. In contrast, a baseball has some wobble when observed at a sufficiently close scale, and therefore the entropy of its distribution of motion is greater than zero. Said otherwise, there’s some mix of velocities to most objects, even if what you see appears to be rectilinear at our scale of observation. To cut through any noise in observation, you could cluster the observed velocities, since this will cause similar velocities to be clustered together, and you can then measure the entropy of the distribution of the resultant clusters, which is something most of my clustering algorithms do anyway.

In the case of a system subject to mechanical control, you could of course have macroscopic wobble, as a result of some adjustment, which is generally undesirable. Instead, what you generally want, is a smooth rate of change, because even if it’s not a vehicle or vessel for humans or living systems, aesthetically, jumpy, uneven motions are not appealing, and could even give the impression of incompetence, or malfunction.

We can quantify this, again using entropy. For example, imagine a platform that has a series of four columns that support it from below, and we want to adjust its tilt. If you can only deliver one instruction at a time, then its motions will probably be wobbly. For example, let’s say we want to drop the two columns on the right to some lower, equal height, causing the platform to slope downward from left to right. Assuming all columns start at equal heights, we will need to deliver an equal number of decline instructions to the two columns on the right to cause them to drop to equal heights. This will result in a sequence of signals that are comprised of exactly two instructions, both equal in number, which will have an entropy of \log(2). And again, this will cause the platform to wobble, because the front and the back columns on the right will be at different heights at all times until the drop is complete. Now assume instead we deliver simultaneous instructions to both the front and back column on the right. This will produce a uniform set of signals, that consists of exactly one instruction, delivered repeatedly. Therefore, this set of signals has a zero entropy distribution.

What this simple example highlights is that we can use entropy to measure the complexity of the distribution of instructions delivered to some mechanism, which could allow us in turn to measure the complexity of the resultant behavior. We could go further, and attempt ex ante to minimize the entropy of the distribution of instructions over some collection of sets of instructions that all achieve the same end goal state. This minimum entropy sequence of instructions would be in this view, the simplest way to get to the goal state.

The Surprisal of a Distribution

The probability of throwing a billion heads in a row using a fair coin is no different than any other outcome involving a billion coin tosses. If we measure the surprisal of observing a billion heads in a row using Shannon information, it’s the same as every other outcome, since their probabilities are all equal, forming a uniform distribution. At the same time, observing a billion heads in a row is beyond surprising, as a practical matter, and would probably be a story you tell for quite a while (putting the time constraints aside).

If however, we look at the distribution of outcomes within each throw, then we get a very different measure of surprisal. Specifically, there is only one outcome that consists of only heads, and so the probability of that distribution is extremely low, in context, since there are going to be an enormous number of possible distributions over a billion coin tosses. We can abstract further, and look at the structure of the distribution, which in this case consists entirely of a single outcome. There are exactly two distributions with this structure –

All heads and all tails.

We can abstract one more time, looking at the distribution of entropies. And again, there are exactly two zero-entropy distributions –

All heads and all tails.

The surprisal of observing an enormous, uniform toss, using any of these other probabilities will be significant, since the probability will be very low, in context.

This intuition is backed up by common sense –

The uniform string of a billion heads or tails is surprising because it has an objective property that is special, and rare, even though all other underlying strings are generated with the same frequency. You don’t notice the other strings, in real life, because they’re not as special in this view, even though they all carry the same underlying probability.