Partitioning Datasets

A while back, I had an idea for an algorithm I gave up on, simply because I had too much going on, but the gist is, my algorithms can flag predictions that are probably wrong, and so pop all those rows into a queue, and you let the rest of the predictions go through, in what will be real time, even on a personal computer. The idea is to apply a separate model to these “rejected” rows, since they probably don’t work with the model generated by my algorithm. This would allow you to efficiently process the most simple corners of a dataset in polynomial time, and then apply more computationally intense methods to the remainder, using threading, all the normal capacity allocation techniques, which will still allow you to fly in close to real time, you just delay the difficult rows until they’re ready. The intuition is, you stage prediction based upon whether the data is locally consistent, or not, and this can vary row by row within a dataset. And this really is a bright-line, binary distinction (just read the paper in the last link), and so you can rationally allocate processing capacity in this way, where if a prediction is “rejected”, you bounce it to a queue, until it has some critical mass, and you then apply whatever method you’ve got that works for data that isn’t locally consistent, which is basically everyone else’s models of deep learning.


Discover more from Information Overload

Subscribe to get the latest posts sent to your email.

Leave a comment