A Note on Unbalanced Datasets

Note that if you’re using my algorithms on a dataset where the dimensions contain drastically different values, then performance could be negatively affected. One simple way to identify this problem, is to measure the standard deviation of the log base 10 of each dimension of the dataset.

So for example, if a row contains the entries [1 10 100], we would calculate log([1 10 100]) = [0 1 2], and then calculate the standard deviation of that vector, which is in this case 1. If the standard deviation of this vector is above 1, then you should probably experiment with weighting the data before training the algorithms. This can be done by a simple loop that divides the outlier large dimensions by powers of 10, and then tests the resultant accuracy, of course picking the weights that generate the greatest accuracy.

On a related note, I’m generalizing my statespace optimization algorithm to allow for totally arbitrary input data, where the algorithm will decide on its on how much to weight a particular dimension of a dataset.

Note that this is not what gradient descent and other interpolation algorithms do. Those types of algorithms use the weights to classify or predict data. My AI algorithms can already simulate basically any machine learning or deep learning algorithm as they stand.

This additional step will instead allow my algorithms to identify which dimensions are relevant when given a dataset that might contain a significant amount of irrelevant information (i.e., a significant number of “noise” dimensions), as opposed to unbalanced but relevant “signal” dimensions. That is, this process will allow my algorithms to autonomously identify which dimensions from a dataset are most relevant, and then construct categories using those dimensions.

Discover more from Information Overload

Subscribe to get the latest posts sent to your email.

Discover more from Information Overload

Share this:

Related

Leave a comment Cancel reply