Using Black Tree AutoML to Classify Genetics

I’m currently working on genetics, and already came up with some novel algorithms that I think are interesting, but are not yet empirically tested. I did however, out of curiosity, simply assign the numbers 10,11,12,13 to the bases A,T,G,C, respectively, and ran Black Tree’s Supervised Delta Algorithm on a dataset from Kaggle, that contains dog DNA, and the accuracy was 82.6%, using a simply minuscule percentage of the basepairs in the dataset, which is quite large. Specifically, I used 30 bases from the dataset, which for some rows contained 18,908 bases. Yet again, this software outperforms everything else by simply gigantic margins. The runtime was 0.31 seconds, running on a cheap laptop. You can do a limited version of exactly this using the totally Free Version of Black Tree AutoML, that will let you run Supervised Delta on just the training dataset (i.e., there’s no training / testing for free). My next step will be to optimize the values assigned to the bases, which are obviously not meaningful, and I have an optimization algorithm that can do exactly that. Here’s a screen shot of the output using Black Tree Massive:

 

Here are the datasets (dog_training, dog_testing), simply download and run Black Tree on these, though again, the Free Version will limit you to just the training file. I’m still hunting down the provenance of the dataset, because it is Kaggle and not a primary source, but if this is a bona fide dataset, it would be evidence for the claim that DNA is locally consistent. If that is true, then my existing software (and other simple methods such as Nearest Neighbor) will as a general matter perform, which means there’s really no reason to do much more work on the topic. This would allow researches to focus on identifying genes and classifying populations, and developing therapies.


Discover more from Information Overload

Subscribe to get the latest posts sent to your email.

Leave a comment