This is exactly the same technique I posted the other day, that uses sorting to achieve an approximation of the nearest neighbor method. Then, clusters are generated by beginning at a given index in the sorted list, and including rows to the left and right of the index, iterating through increasing boundary indexes, simulating expanding the radius of a sphere (i.e., as you include more rows to the left and right of a given index, you are effectively including all rows within a sphere of increasing radius from the origin / index). The only substantive change in this case, is the use of volume in calculating confidence (the other changes were corrections). Specifically, uncertainty increases as a function of volume, for the simple reason that, if you have a cluster contained within a large volume, and few rows in the cluster, then you have more uncertainty (c.f., a large number of rows contained in a small volume). In my original confidence-based classification algorithm, the spheres are all fixed sizes, so you could argue it didn’t matter in that case (especially since it works). I came up with yet another method, that makes use of a combinatorial measure of uncertainty that I mentioned in Footnote 14 of Information, Knowledge, and Uncertainty, which also should work for variable sized clusters, which I should be done with soon.
This particular algorithm is incredibly fast, classifying 4,500 testing rows, over 25,500 training rows, in about two seconds, with accuracy that often peaks at close to perfect (i.e., accuracy increases as a function of confidence). The chart below shows accuracies and runtimes for this method, on average over 100 iterations. All of the testing percentages are fixed at 15% (i.e., 30,000 rows implies 4,500 Testing Rows and 25,500 Training Rows). The Raw Accuracy is the accuracy of the method on its own, with no confidence filtering, which is almost always terrible. The Max Accuracies are given for both the information-based confidence metric, which diminishes as a function of volume, and the size-based confidence metric, which does not, and simply looks to the number of elements in each cluster (i.e., the more cluster elements, the higher the confidence). Confidence filtering is achieved by simply ignoring all predictions that carry a confidence of less than , which generally causes accuracy to increase as a function of
. All of the max accuracies are excellent, except UCI Sonar, which is a dataset that’s given me problems in the past. I plan to address UCI Sonar separately with an analog to my Supervised Delta Algorithm, that will also make use of sorting.
|
Dataset |
Raw Acc. |
Max Acc. (Inf.) |
Max Acc. (Size) |
Runtime (Seconds) |
No. Rows |
|
UCI Credit |
38.07% |
78.30% |
100.0% |
3.290 |
30,000 |
|
UCI Abalone |
17.37% |
83.50% |
71.30% |
0.322 |
4,177 |
|
UCI Spam |
40.83% |
98.61% |
100.0% |
0.469 |
4,601 |
|
UCI Ion |
53.00% |
94.30% |
100.0% |
0.057 |
351 |
|
UCI Iris |
32.50% |
100.0% |
86.20% |
0.024 |
150 |
|
UCI Parkinsons |
47.20% |
95.60% |
100.0% |
0.035 |
195 |
|
UCI Wine |
37.50% |
94.20% |
100.0% |
0.029 |
178 |
|
UCI Sonar |
20.16% |
58.11% |
34.80% |
0.048 |
208 |
There’s more to come, with simple variations on this overall method, that just reapplies my core work using a sorted list. Once that’s complete, I’ll write something more formal about these methods.
Here’s the code, you can find the normalization algorithm a few posts below. The Command Line is set up as a function, for convenience (MASSTempFunction), so just change the dataset path in that file, and you’re good to go.
Discover more from Information Overload
Subscribe to get the latest posts sent to your email.