Nearest-Neighbor on Massive Datasets

In a previous article, I introduced an algorithm that can cluster a few hundred thousand N-dimensional vectors in about a minute or two, depending upon the dataset, by first compressing the data down to a single dimension. The impetus for that algorithm was thermodynamics, specifically, clustering data expanding about a point, e.g., a gas expanding in a volume. That algorithm doesn’t work for all datasets, but it is useful in thermodynamics, and probably object tracking as well, since it lets you easily identify the perimeter of a set of points.

Below is a full-blown clustering algorithm that can nonetheless handle enormous datasets efficiently. Specifically, attached is a simple classification example consisting of two classes of 10,000, 10-dimensional vectors each, for a total of 20,000 vectors.

The classification task takes about 14 seconds, running on an iMac, with 100% accuracy.

I’ve also tested it on two UCI datasets (the “Wine” and “Iris” datasets), and the accuracy was comparable, around 100%, with equally impressive efficiency.

In addition to clustering the data, a compressed representation of the dataset is generated by the classification algorithm, that in turn allows for the utilization of the nearest-neighbor method, which is an incredibly efficient method for prediction, that is in many real world cases, mathematically impossible to beat, in terms of accuracy.

Said otherwise, even though nearest-neighbor is extremely efficient, with a dataset of this size, it could easily start to get slow, since you are still comparing an input vector to the entire dataset. As a result, this method of clustering allows you to utilize nearest-neighbor on enormous datasets, simply because the classification process generates a compressed representation of the entire dataset.

In the specific case attached below, the dataset consists of 20,000 vectors, and the compressed dataset fed to the nearest-neighbor algorithm consists of just 4 vectors.

Predictions occurred at a rate of about 8,000 predictions per second, with absolutely no errors at all, over all 20,000 vectors.

Octave Code:

8-1CMNDLINE

vectorized_EMC_clustering_N

Note that you’ll need to download my full archive to run this code, which you can find on ResearchGate.

Discover more from Information Overload

Subscribe to get the latest posts sent to your email.

Discover more from Information Overload

Share this:

Related

Leave a comment Cancel reply