Fully Vectorized Nearest Neighbor

I’ve reduced the nearest neighbor method to 10 lines of code in Octave / Matlab:

function nearest_neighbors = NN_fully_vectorized(dataset, N)

dataset = dataset(:,1:N); %removes the hidden classifier
num_rows = size(dataset,1);
num_cols = size(dataset,2);
temp_matrix = repmat(dataset’, [1 1 num_rows]);
ref_dataset = shiftdim(temp_matrix,2);
diff_matrix = sum((dataset.-ref_dataset).^2,2);
zero_indeces = 1:num_rows+1:num_rows*num_rows;
diff_matrix(zero_indeces) = Inf; %sets the zero entries to infinity
[a b] = min(diff_matrix);
nearest_neighbors = reshape(b,[num_rows 1]);

endfunction

A Thought on Epistemology

Geometry, like combinatorics, is absolute –

The problem is human error.

If you were able to draw a perfect triangle, and measure it perfectly, you would find the Pythagorean theorem holds, perfectly, if you assume the universe is at least countably infinite, which would produce no real number error, if the triangle is drawn with countable precision.

Empiricism suggests the same –

The more precisely we draw, and the more precisely we measure, the more consistent observation is with mathematics. It also suggests the possibility that physics is absolute, and simply unknown to us.

There is combinatorics, which is plainly absolute, with no possible deviations, since it requires only finite discernment, and a finite number of observations –

Simply counting.

In contrast, Geometry requires an infinite number of points, and is therefore not observably absolute, like combinatorics, but because it requires only finite length proofs, it is absolute as an idea, in that you can prove in a finite number of steps, any knowable claim about a geometric object.

Physics requires infinite observation to know with certainty, and therefore, it is not possible to know without infinite time, whether a given model of physics is truly absolute.

This creates a hierarchy of knowledge:

  1. Combinatorics, known to be absolute.
  2. Geometry, known to be absolute as an idea.
  3. Physics, cannot be known as absolute in finite time.

However, if the Universe is for example countably infinite, and if your perspective were infinitesimal, able to make infinitesimal measurements, then you would see the imperfections in what appear to be perfect curves, since you would see the local linearity of a curve that is perfect from our perspective, which is limited to computable measurement.

Information, Knowledge, and Uncertainty

In a previous article, I introduced a rigorous method for thinking about partial information, though for whatever reason, I left out a rather obvious corollary, which is that you can quantify both uncertainty and knowledge in terms of information, though I did go on a rant on Twitter, where I explained the concept in some detail.

Though I’ve done a ton of work on the issue, which I’ll expand upon in other articles, the fundamental equation that follows is,

I = K + U,

Where I is the total information of the system, K is the subjective knowledge with respect to the system, held by some observer, and U is the uncertainty of that same observer with respect to the system, all having units of bits.

To develop an intuition for the equation, consider a set of N labeled boxes, one of which contains a pebble. The system of the N boxes and single pebble can be in any one of N states, since the pebble can be in any one of the boxes, with each such configuration representing a unique state of the system. As a storage device, the system can be in any one of N states, and can, therefore, store \log(N) bits of information.

I am plainly making the assumption that this is the correct measure of the information content of the system as described. What cannot be debated is the fact that the system, as described, can be used to store \log(N) bits. Moreover, as a general matter, the storage capacity of any system is always given by the logarithm of the number of states of the system, and so, though there could be other useful measures of the information content of a physical system, the log of the number of states of the system is both objectively, and physically meaningful.

Returning to the example of the pebble in one of N boxes, imagine you have no information at all about which box contains the pebble, and no knowledge of the probabilities that the pebble is in a given box. In this case, your knowledge is exactly zero, and so your uncertainty is maximized at \log(N). If we increase N, we increase your uncertainty, which as a practical matter, is exactly what would happen, since you now have more possible outcomes. Stated more formally, the system can be in a greater number of states, which as a mathematical and practical matter, increases your uncertainty with respect to the system.

You can read through the previous article to understand the mathematics of how knowledge changes as a function of observation and receipt of information, but the intuition is straightforward: learning about a system eliminates possible states of the system. For example, if I tell you that the pebble is in the first box, then it isn’t in the rest of them, and your uncertainty drops to zero. If I tell you that the pebble isn’t in the first box, then your uncertainty decreases, but is still greater than zero.

Now imagine there are M pebbles, and N boxes, for some M < N. It follows that this system can be in {N \choose M} states. Intuitively, your uncertainty should follow the arc of the binomial coefficient, as a function of M, increasing for some time, and then eventually decreasing again past M = \frac{N}{2}. Again, this causes the measure of uncertainty introduced above to change in a manner that is consistent with intuition, in that the more states a system can be in, the greater your uncertainty is with respect to the system.

In Shannon’s original paper, in addition to introducing entropy as a measure of minimum average encoding length, he also proved that the equation satisfies a sensible definition of uncertainty. If you apply this idea to physical systems, what you end up with is an equivalence between the minimum average code length required to represent the system, and the uncertainty of an observer with respect to the system.

Again, these are assumptions that make use of underlying mathematical theorems, and are not theorems in and of themselves. Said otherwise, you could argue that physical uncertainty is not appropriately measured by the Shannon entropy, or the ideas I’ve introduced above, and in the previous article, which is fine. My goal is to present a practical measure of uncertainty, that moves in the correct direction, as a function of observation, and receipt of information, understanding that it may not even be possible to prove a physical claim of this type in any decisive manner.

It follows that if N is the number of states of the system, then simply substituting uncertainty with Shannon’s equation for entropy, we have,

I = K + NH,

where H = \sum_{i = 1}^{N}p_i\log(\frac{1}{p_i}).

So just a matter of simple algebra, we have that,

K = I - NH.

This is consistent with intuition, in that as the entropy of a probability distribution decreases, you have more knowledge about the behavior of the system ex ante. In the extreme case, where the system is almost certain to be in a particular state, you have almost perfect knowledge about its behavior, since you can reliably expect it to be in that particular state.

Returning to the example above, let’s assume that the pebble is almost certain to be in the first box. This implies that the entropy of the probability distribution of the location of the pebble will be approximately zero, which in turn implies almost perfect knowledge about the system. This is consistent with intuition, since you can reasonably expect the pebble to be in the first box, basically all the time.

Distributed Linear Interpolation

In a previous article, I introduced a vectorized polynomial interpolation algorithm that is extremely efficient, generating reasonably accurate interpolations of functions in about one to three seconds. You can of course refine the results, which I discussed as well, but the point being, that by using vectorization, we can turn what are otherwise brute force techniques, into elegant and efficient solutions to problems in computing.

I realized that you could take a similar approach to problems that are by their nature distributed, making use of a large number of linear approximations. For example, imagine a gas expanding in a volume, and assume that you want to predict the density in the regions over time. You could subdivide the volume into equally sized cubes, and then use a series of linear equations to predict the densities in each cube. Specifically, if we have N cubes, then we would need N linear equations, with each equation f_i(t) = t\theta_i corresponding to a cube in the volume, approximating its density over time.

This can be done using almost exactly the same code I introduced in the previous article. However, if N > 10, its implementation becomes problematic, simply because there are too many permutations to consider, unless you’re using a truly massively parallel machine, in which case it should be fine. But as a workaround, you can use a Monte Carlo method instead, rather than iterate through all possible permutations of each \theta_i. That is, simply generate a massive number of coefficients, which is vectorized in Octave, rather than all possible coefficients at a given level of granularity. For context, it takes .018 seconds to generate 1.5 million random numbers in Octave, running on an iMac, which is the only initial step necessary to make use of the code I introduced in the previous article.

You could of course repeat this process, adding a constant term, or perhaps higher dimensional terms, creating quadratics, cubics, and other higher dimensional approximations, if necessary, for accuracy. Simply fix the coefficients obtained from the first application, and repeat the same process, generating a large number of possible coefficients for the higher or lower dimensions terms.

An additional upside to this approach is the fact that evaluating a polynomial is also vectorized in Octave, an example of which is included in the code I introduced in the previous article. This implies that evaluating the behavior of a system like an expanding gas, or a moving fluid, can be implemented in a distributed manner, which is not only more efficient, but realistic, in that the parts of a system that aren’t physically proximate, are probably independent, at least in the short run.

A Note on Entropy and Time

I just came to a very strange realization:

The amount of time you measure, depends upon your scale of observation.

To understand why this is the case, imagine you spill ink into a single sheet of paper. Now divide the paper into a grid, of equally sized boxes, and count the number of ink molecules in each box over time as the ink stain spreads.

Now imagine the ink stain is your clock.

This is realistic, in that it’ll take time for the stain to spread, and so you could measure time by measuring the distribution of the ink among the boxes over time. Said otherwise, as the ink spreads, the amount of time that passes increases. But we can formalize this by measuring the entropy of the ink stain in the grid. Specifically, the ink stain begins in exactly one box, producing an entropy of zero, since it is the entropy of a distribution over a single event, with a probability of one.

Eventually, the ink should be roughly uniformly distributed among the boxes, producing the maximum possible entropy. As a result, the entropy of the distribution of the ink stain is a measure of time. However, what’s truly bizarre about this, is that the amount of time you measure in this model depends upon the dimensions of the grid.

Just imagine a grid with one box –

That’s a clock that never moves, since the ink is always on the page, and therefore, the entropy is always zero. This is actually profound, and shows that if you want to truly distinguish between moments in time, you need a system that can be in a large number of states. For example, a normal clock repeats every 24 hours, but it’s supplemented by a day, month, and year, to distinguish it from other positions in time. More formally, a calendar is a system that has an infinite number of states. And you can’t truly tell time without such a system, otherwise you’ll end up labelling two different moments with the same name.

That said, calendars are only conceptually infinite, but are practically finite, because human beings live finite lives, and even the human species itself has been around for only some finite amount of time. It follows that unless the Universe itself is capable of being in an infinite number of states, it will eventually repeat the same state. Said otherwise, the state of the Universe as a whole is a clock, and if it’s only capable of being in some finite number of states, then by definition, eventually, if time itself is infinite, the same exact state of the Universe must happen twice.

Vectorized Interpolation

I’ve written an algorithm that vectorizes polynomial interpolation, producing extremely fast approximations of functions. The algorithm works by generating all possible polynomials of a given degree, and then evaluates all of them, but because these operations are vectorized, the runtime is excellent, even on a home computer.

As a practical matter, the runtime is of course sensitive to the algorithm’s parameters, which are (x) the number of possible values for each coefficient, and (y) the degree of the polynomial.

The polynomial is expressed as follows:

T(x) = \theta_0 + \theta_1 x^1 + \ldots + \theta_k x^k.

The algorithm works by quantizing the possible values of each \theta_i, which have an initial range of [-1,1]. By subdividing this interval into equally sized increments, we create a set of N = \frac{2}{increment} possible values for each \theta_i. We can then generate all possible resultant polynomials of a given degree k, by generating all possible ordered selections of k elements from a set of N elements. That is, each of the k coefficients \theta_i can take on any one of N possible values, resulting in N^k polynomials.

This algorithm then evaluates every resultant polynomial at every point in the domain, and selects the polynomial that produces the least total error, when compared to the range of the function.

Ordinarily, this would be an incredibly inefficient approach, but because all of these steps can be vectorized, the runtime is excellent, and on a sufficiently parallel machine, each step could be fully vectorized, likely allowing for basically instantaneous testing of every possible polynomial.

Running on an iMac, the algorithm generates and tests 160,000 fourth-degree polynomials in 2.38 seconds, and the best curve is shown below in green, together with the original sin curve in blue.

The algorithm generates and tests 160,000 linear approximations in 1.06 seconds, with the results shown below.

Note that increasing the degree of the polynomial of course increases the number of coefficients, and therefore, the number of possible polynomials. As a result, higher degree polynomials effectively include all possible lower degree polynomials, since any coefficient could be approximately or exactly zero. For example, using a fourth-degree polynomial could produce a roughly straight line. Therefore, the degree of the polynomial will of course impact the runtime on an ordinary machine. But nonetheless, even ordinary machines are clearly capable of vectorized computing, making this approach useful for essentially every modern device.

However, this is just a first step, and the real point of this algorithm is that it can be used as an operator in a more elaborate optimization algorithm. For example, rather than simply terminate after this one step, you could instead analyze the results, and repeat the process on some subset of the polynomials generated and tested. One obvious additional step would be to include linear terms that adjust the position of the curve along the x-axis:

T(x) = \theta_0 + \theta_1 (x - \alpha_1)^1 + \ldots + \theta_k (x - \alpha_k)^k.

We could apply the exact same process to the linear terms \alpha_i, and select the set that minimizes the resultant error.

As a general matter, you could take the view that this approach is similar to a Monte Carlo approach, but I would disagree, since this process is deterministic, and covers a complete range of values, at a given level of granularity, and is therefore, not random at all, but is instead deterministic. It is in my opinion the same approach I use to A.I. generally: select a scale of granularity in observation or discernment, and then use vectorization to the maximum extent possible, to solve the problem at that scale. Said otherwise, first compress the problem to a tractable space, and then use vectorization to solve the problem in that space.

All that said, this approach can be easily adapted to a Monte Carlo method, by simply generating random values of each \theta_i, rather than generating all possible values over some quantized space. The advantage of a Monte Carlo method would likely be even greater speed, since you don’t have to make use of the operators that generate all combinations and permutations, and instead simply use vectorized random number generation.

Obviously, you could also use this for classifications, by applying T(x) to a classification function, and if necessary, adding additional weights to the classifier values, since they are generally arbitrary.

Octave Code:

A Note on Contingent Acceleration

I just saw an airplane turn on my walk home, and it dawned on me how strange it is that a machine can change its own velocity, which you could argue is a consequence of how I think about time. Specifically, I think about time as a set of possible outcomes along a discrete graph, simply because it’s a practical way to think about how a system can change, especially if you’re writing software to model its behavior. That is, if the system is in a given state, then there is some set of possible next states, and so on, which will produce a discrete graph. I happen to think that time is actually physically structured this way, but what’s relevant for this discussion, is that you can certainly model time as a discrete graph, as a practical matter.

Returning to the airplane, when it’s heading straight ahead, there will be some set of possible next states for the airplane’s motion at any given moment. But if the airplane were an ordinary projectile, this would not be the case, macroscopically, and instead, it would have a simple parabolic trajectory that would be unchangeable, absent intervention. So an ordinary projectile moving through space near Earth, doesn’t really have an interesting path through time, but is instead deterministic, at least at our scale of observation, absent intervention. In contrast, because an airplane can turn, increase velocity, or perhaps tilt, at any given moment, the next moment in time could always be some legitimately different state, leading to a multiplicity of possible paths through time, representing the future outcomes that are possible at any given moment.

As an example, let’s assume the plane is flying itself, which is realistic, and during the flight, the plane responds to some weather data, that causes it to change its course, turning the plane in a particular direction. Tracing the events that lead to this outcome, we have an exogenous event, which is the weather itself, then we have the systems within the airplane, receiving a signal representing the change in the weather, and then a response in the software piloting the plane, and ultimately the physical mechanisms of the plane itself, within the engines and the wings, responding to the software’s instructions.

It’s not as if any of this violates the conservation of momentum, since the plane is powered by fuel and electricity, to which there’s no magic, though the plane’s role in this set of facts requires electrical and mechanical energy to be already extant within the plane, in order for it to first receive the transmission, calculate the appropriate response, and then physically execute that response. However, what is in my opinion fascinating, upon reflection, is that the ability of systems of this type to respond to exogenous signals, allows these systems to have complex paths through time, that are contingent upon information that is exogenous to the system. In this case, even when piloting itself, the airplane can effectively decide to change directions, meaning that at any given moment, its path through time is very different from any ordinary, inanimate projectile.

All systems can of course be accelerated by exogenous momentum, which is typically a collision. But this is different: this is instead the case of a system that is ultimately accelerated by an exogenous piece of information. It is ultimately the ability to respond to information about an environment that makes the path of a system like this through time fundamentally different from any inanimate system. Though airplanes are obviously distinguishable from living systems, what I suppose I never fully appreciated, is that they arguably have more in common, in terms of their behavior through time, with living systems, than inanimate systems, since they are capable of acceleration that is contingent upon exogenous information.

To make it perfectly clear, we can distinguish between, for example, the position of the wheel of a car, and the information transmitted to the airplane: the position of a wheel is a physical state of the car itself, whereas the transmission regarding the weather is a representation of the state of an exogenous system, that is transmitted to the plane, that in turn causes a response within the plane itself. The point being, that a plane can respond to a representation of the state of an exogenous system, which really is the ability to autonomously respond to information, as opposed to being accelerated by some circumstance physically connected to the plane.