Vectorized Correlation

Attached is some code that makes use of a measure of correlation I mentioned in my first real paper on A.I. (see the definition of “symm”) that I’ve finally gotten around to coding as a standalone measure.

The code is annotated to explain how it works, but the basic idea is that sorting reveals information about the correlation between two vectors of numbers. For example, imagine you have a set of numbers from 1 to 100, listed in ascending order, in vector $x$ , and the numbers -1 to -100, in vector $y$ , listed in descending order. This would produce the following plot in the $(x,y)$ plane:

Now sort each set of numbers in ascending order, and save the resultant mappings of ordinals. For example, in the case of vector $x$ , the list is already sorted in ascending order, so the ordinals don’t change. In contrast, in the case of vector $y$ , the list is sorted in descending order, so ordinal 1 gets mapped to the last spot, ordinal 2 gets mapped to the second to last spot, and so on. This will produce another pair of vectors that represent the mappings generated by the sorting function, which for vector $x$ will be $s_x = (1,2, \ldots, ... N)$ , and for vector $y$ will be $s_y = (N, N-1, \ldots, ... 1)$ , where $N$ is the number of items in each vector. Therefore, by taking the difference between the corresponding ordinals in $s_x$ and $s_y$ , we can arrive at a measure of correlation, since it tells us to what extent the values in $x$ and $y$ share the same ordinal relationships, which is more or less what correlation attempts to measure. This can be easily mapped to the traditional $[-1,1]$ scale, and the results are exactly what intuition suggests, which is that the example above constitutes perfect negative correlation, an increasing line constitutes perfect positive correlation, and adding noise, or changing the shape, diminishes correlation.

Because I’ve abstracted sorting using information theory, you could I suppose measure the correlation between any two ordered sets of mathematical objects.

Also attached is another script that uses basically the same method to measure correlation between numerical data and ordinal data. The specific example attached allows you to measure which dimensions in a dataset (numerical) are most relevant to driving the value of the classifier (ordinal).

correlation-cmndline-1 Download

ordinal-corr.-cmndline Download

Discover more from Information Overload

Subscribe to get the latest posts sent to your email.

Discover more from Information Overload

Share this:

Related

Leave a comment Cancel reply