Measuring Dataset Consistency

In a previous article, I showed that you can use the nearest neighbor algorithm to build a simple graph based upon which vectors in a dataset are nearest neighbors of one another. In simple terms, if two vectors are nearest neighbors, then you connect them with an edge. Once you do this for all vectors in the dataset, you will have a graph. It dawned on me earlier today that you can define a measure of dataset consistency by analyzing the properties of the resultant graph. Specifically, take the set of maximal paths in the dataset $S$ , and label the vertices with their classifiers. Now, for each such path in $S$ , simply take the entropy of the distribution of classifier labels. This is exactly the technique I introduced in another previous article, but I suppose I didn’t make it plain that it can apply to any dataset. You can also perhaps use this to measure whether any errors are due to truly similar classes by first converting the labels to measure-based classifiers, using the method I introduced in the article before this one. That is, if the errors only appear where two classes are truly similar, then maybe that’s the way it is and that’s the best you can do, or, you can somehow tighten your lens on the dataset, which is something you can actually do in my model of machine learning.

To be perfectly clear, if any given path in $S$ contains more than one classifier, then that path by definition implies that the nearest neighbor algorithm will generate an error, and therefore, the dataset is what I would call not locally consistent. Here’s a brief proof:

Assume that $v_i$ and $v_j$ are both vertices in some path $P$ within the set $S$ . Because they are both part of the same path, there must a subset of $P$ that connects $v_i$ to $v_j$ , which we’ll call $P_{i,j}$ . Assume that the classifier of $v_i$ is $m$ and that the classifier of $v_j$ is $k$ , for $m \neq k$ . It follows that as we traverse $P_{i,j}$ , beginning at $v_i$ , the classifiers of the vertices along that path must at some point change from $m$ to $k$ , and let $e = (v_p, v_l)$ be the edge where this change occurs. Note that it could be the case that $v_l = v_j$ . Because $v_l$ is the nearest neighbor of $v_p$ , and by definition the classifiers of $v_p$ and $v_l$ are unequal, it is the case that the nearest neighbor algorithm generated an error when given the input $v_p$ .