Vectorized Genetic Classification

November 6, 2022November 7, 2022 / erdosfan

Some of my recent work suggests the possibility that genetic sequences are locally consistent, in that small changes to bases do not change macroscopic classifiers (e.g., species, common ancestry, and possibly even traits). I’m still testing this empirically, but I’ve already developed an algorithm that produces perfect accuracy on an NIH Dataset I put together from Influenza A and Rota Virus samples. The classification task was in that case really simple, simply distinguishing between the two species, but the accuracy was perfect. I only used 15 rows of each species, and it was literally perfect.

I also tested it on three datasets from Kaggle, that contain different genetic lines of dogs, chimpanzees, and humans. The classification task in that case is to predict the subclass / family line of each species, given its genetic base data. This is a tougher classification because you’re not distinguishing between species, and instead you’re identifying common family lines within each species. Accuracy peaked at perfect for all three, which is expressed as a function of confidence. For the NIH dataset it was perfect without confidence filtering. The method I’ve employed in all cases is a simple modification of the Nearest Neighbor algorithm, where sequences x and y are treated as nearest neighbors if x and y have the largest number of bases in common over the dataset of sequences. This implementation is however highly vectorized, resulting in excellent runtimes, in this case fully processing a dataset with 817 rows and 18,908 columns in about 10 to 12 seconds. There is then a measure of confidence applied that filters predictions.

I will continue to test this algorithm, and develop related work (e.g., a clustering algorithm follows trivially from this since you can pull all sequences that have a match count of at least k for a given input sequence). Here’s a screen shot for the Kaggle dog dataset, with the number of surviving predictions as a function of confidence on the left, and the accuracy as a function of confidence on the right: