Genetic Clustering Algorithm

Attached is a clustering algorithm specific to genetics. It pulls all genomes that have a minimum number of K matching bases in a dataset, with respect to a given input sequence. This algorithm is mentioned in this paper. There’s an additional segment of code that is based upon my unsupervised clustering algorithm that generates a value of K for you.

Also attached is an algorithm that analyzes the inter-connectivity of a cluster. Specifically, it’s hard to find circuits and other structures in graphs, so as a workaround, this algorithm simply takes the union of the rows in a sequence of clusters, and tracks the rate of change in the number of elements. So for example, if the first cluster contains 400 elements, and the next row contains 410 elements, only 10 of which are novel, then the total cardinality of the union will be relatively unchanged by 10. By tracking this rate of change we get a measure of the interconnectivity of a sequence of clusters.

Some More Methods in Computational Genomics

Two more methods in genomics I wanted to share, the first being a simple measure of signal versus noise rooted in my work on epistemology. The second is a clustering algorithm I came up with a long time ago but never implemented, rooted in the Nearest Neighbor method.

Assume your population is stored as an M \times N matrix, with M sequences, and N bases per sequence. Now assume we measure the density of each of the four bases in column i. If it turns out that the densities are given by (0.9, 0.05, 0.02, 0.03) for A,C,G,T, respectively, it’s reasonable to conclude that the modal base of A with a density of 0.9 is signal, and not noise, in the sense that basically all of the sequences in the dataset contain an A at that index. Moreover, the other three bases are roughly equally distributed, in context, again suggesting that the modal base of A is signal, and not noise.

When you have a continuous signal, noise is easier to think about, because if it’s both additive and subtractive, it should cancel itself out through repeated observation. A population drawn from a single species is analogous, because you are in effect repeatedly sampling some “true” underlying genetic sequence, subject to mutations / noise. Unfortunately, you cannot meaningfully take the average of a discrete signal sequence like a genome. However, this doesn’t change the fact that a uniform distribution is indicative of noise. So if for example, the densities were instead (0.24, 0.26, 0.25, 0.25), then we could reasonably dismiss this index as noise, since the bases are uniformly distributed at that index across the entire population. In contrast, if we’re given the densities (0.9, 0.1, 0.0, 0.0), then we should be confident that the A is signal, but we should not be as confident that the C is noise, since the other two bases both have a density of 0.

We can formalize this by applying the measures of Information, Knowledge, and Uncertainty I presented in the paper of the same name. Information is in this case the applicable maximum entropy, which is log(4) = 2 for all four bases, log(3) for three bases, etc. Uncertainty is is the entropy of the distribution, and Knowledge is the balance of Information over Uncertainty, K = I - U. Interestingly, this suggests an equivalence between Knowledge and signal, which is not surprising, as I’ve defined Knowledge as, “information that reduces uncertainty.”

I also wrote an interesting clustering algorithm that keeps applying Nearest Neighbor in a chain, until it hits a loop. So for example, first we find the Nearest Neighbor of row i, let’s say it’s row j. Then we find the Nearest Neighbor of row j, and keep doing this until we create a loop, by returning the same row more than once. I haven’t tested it too much, but it seems meaningful because the clusters are generally very small, suggesting that it’s not flying around the dataset, and is instead pulling rows that are proximate to each other.

 

Partial Matches in Genomes

I’m continuing to work on the applications of Machine Learning to Genomics, and my work so far suggests that genetic sequences are locally consistent, in that the Nearest Neighbor method generally produces good accuracy. More importantly, the Nearest Neighbor of a given sequence seems to be determined by only a fraction of the total sequence. Specifically, if we sample a random subset of the bases, and increase the size of that subset, sequences quickly converge to their true Nearest Neighbor. That is, even if we use only a fraction of the bases, the Nearest Neighbor method returns the true Nearest Neighbor that would be found if we used all of the bases. This is surprising, and is consistent with the hypothesis that bases are not truly random. Specifically, this is consistent with the hypothesis that a small number of bases, determines some larger number of bases. The attached code randomly selects an increasing number of bases, and applies the Nearest Neighbor method. Because the base indexes are randomly selected, it suggests that every subset of a sequence partially determines some larger sequence.

If this is true, then genetic sequences are not truly random. If they are not truly random, how do we explain the diversity of life, even given the amount of time in play? Note that this would not preclude randomized mutations to individual bases, but it does suggest that a sizable number of randomized mutations will cause an even larger sequence to mutate. Below is a plot of the percentage of sequences that converge to their true Nearest Neighbor, given x\% of the bases in the full sequence (i.e., the horizontal axis plots the percentage of bases fed to the Nearest Neighbor algorithm). The plot on the left uses the genome of the Influenza A virus, and the plot on the right uses the genome of the Rota virus. Both are assembled using data from the National Institute of Health.

This brings genetic sequences closer to true computer programs, because in this view, not all sequences “compile”, and as a consequence, if two sequences match on some sufficiently large but sufficiently incomplete number of bases, then they should match on an even larger subset of their bases, since the matching subset fixes some larger subset of the bases. This would allow for partial genetic sequences to be compared to datasets of complete genetic sequences, potentially filling in missing bases. If this is true, it also casts doubt on the claim that the diversity of life is the product of random and gradual changes, since the number of possible base-pairs is simply astronomical, and if most of the possible sequences don’t code for life, as a result of possibility being restricted by selection, you’ll never get to life in any sensible amount of time, because all the sequences that contain prohibited combinations will be duds, and not persist, thereby cutting short that path through the space of possible sequences. In other words, this would require luck, to simply land at the correct mutations, rather than gradually arriving at the correct mutations in sequence.

We can resolve this by allowing for both small-scale mutations, and large-scale mutations, which could lead to drastic changes in measurable traits. In other words, this doesn’t preclude evolution, it simply requires evolution to be extremely subtle (i.e., some small number of bases mutate), or drastic (i.e., some large number of bases all change together). In summary, if this is true, then mutations are either small in number, or sufficiently large in number, causing an even larger sequence to change, generating either duds, or traits that are permitted. Natural Selection would then operate on the traits that are permitted, allowing the environment to determine what traits persist in the long run.

A Note on Signal Versus Noise

I spend a lot of time thinking about signal and noise, and my intuition is that a signal is something that has a low Kolmogorov Complexity, and noise causes it to have a higher Kolmogorov Complexity. That sounds nice, and I think it makes sense, but the question is, what can I do with this definition? I think the answer so far is nothing, and it’s purely academic. However, it just dawned on me that if noise is statistically random, then it should begin to cancel itself out when it’s sampled independently at a large number of positions. Just imagine pure noise, and sample at different places at the same time – it should either net to zero (additive and subtractive noise), or increase in amplitude (strictly additive noise) as your number of samples grows sufficiently large. This in turn implies that we should be able to discern between true noise, and a signal subject to noise, by increasing the number of independent samples of the ostensible signal. Specifically, if it grows to nothing, or increases in amplitude, without changing structure, then it’s pure noise. If it instead changes structure, then it suggests that there is an underlying signal, subject to additive and subtractive noise that is starting to net itself out. In contrast, I don’t think there is any obvious solution to strictly additive noise. Nonetheless, this is a pretty good practical definition of signal versus noise, that definitely works, and I just wrote some simple code in Octave where you increase the number of independent noisy representations of a signal, and the average plainly reduces noise, as a function of the number of independent representations.

Vectorized Genetic Classification

Some of my recent work suggests the possibility that genetic sequences are locally consistent, in that small changes to bases do not change macroscopic classifiers (e.g., species, common ancestry, and possibly even traits). I’m still testing this empirically, but I’ve already developed an algorithm that produces perfect accuracy on an NIH Dataset I put together from Influenza A and Rota Virus samples. The classification task was in that case really simple, simply distinguishing between the two species, but the accuracy was perfect. I only used 15 rows of each species, and it was literally perfect.

I also tested it on three datasets from Kaggle, that contain different genetic lines of dogs, chimpanzees, and humans. The classification task in that case is to predict the subclass / family line of each species, given its genetic base data. This is a tougher classification because you’re not distinguishing between species, and instead you’re identifying common family lines within each species.  Accuracy peaked at perfect for all three, which is expressed as a function of confidence. For the NIH dataset it was perfect without confidence filtering. The method I’ve employed in all cases is a simple modification of the Nearest Neighbor algorithm, where sequences x and y are treated as nearest neighbors if x and y have the largest number of bases in common over the dataset of sequences. This implementation is however highly vectorized, resulting in excellent runtimes, in this case fully processing a dataset with 817 rows and 18,908 columns in about 10 to 12 seconds. There is then a measure of confidence applied that filters predictions.

I will continue to test this algorithm, and develop related work (e.g., a clustering algorithm follows trivially from this since you can pull all sequences that have a match count of at least k for a given input sequence). Here’s a screen shot for the Kaggle dog dataset, with the number of surviving predictions as a function of confidence on the left, and the accuracy as a function of confidence on the right:

Attached is the code, subject to my stated Copyright Policy (as always).

Using Black Tree AutoML to Classify Genetics

I’m currently working on genetics, and already came up with some novel algorithms that I think are interesting, but are not yet empirically tested. I did however, out of curiosity, simply assign the numbers 10,11,12,13 to the bases A,T,G,C, respectively, and ran Black Tree’s Supervised Delta Algorithm on a dataset from Kaggle, that contains dog DNA, and the accuracy was 82.6%, using a simply minuscule percentage of the basepairs in the dataset, which is quite large. Specifically, I used 30 bases from the dataset, which for some rows contained 18,908 bases. Yet again, this software outperforms everything else by simply gigantic margins. The runtime was 0.31 seconds, running on a cheap laptop. You can do a limited version of exactly this using the totally Free Version of Black Tree AutoML, that will let you run Supervised Delta on just the training dataset (i.e., there’s no training / testing for free). My next step will be to optimize the values assigned to the bases, which are obviously not meaningful, and I have an optimization algorithm that can do exactly that. Here’s a screen shot of the output using Black Tree Massive:

 

Here are the datasets (dog_training, dog_testing), simply download and run Black Tree on these, though again, the Free Version will limit you to just the training file. I’m still hunting down the provenance of the dataset, because it is Kaggle and not a primary source, but if this is a bona fide dataset, it would be evidence for the claim that DNA is locally consistent. If that is true, then my existing software (and other simple methods such as Nearest Neighbor) will as a general matter perform, which means there’s really no reason to do much more work on the topic. This would allow researches to focus on identifying genes and classifying populations, and developing therapies.

Genetic Delimiting

I’ve written a simple delimiter algorithm that allows you to delimit genetic sequences. The basic idea is, if a base is sufficiently dense at an index in a population, then it could be part of a gene, and so a sequence of sufficiently dense bases in order, is therefore useful to identify, since that sequence of bases could therefore form a gene.

The basic idea for this algorithm is really simple: calculate the densities of the bases in a population, at each index in a sequence, then calculate the standard deviation of those densities.

Now read the densities in the order of the underlying genetic sequence, and if you see the density change by more than the standard deviation, mark that index with a delimiter. So if e.g., the standard deviation is 20%, and there’s a jump from 90% density to 50% density, then you put a delimiter between the 90% and 50% dense bases, marking the end of a gene.

All of the operations in the code can be run in parallel, and so it has a constant runtime, on a parallel machine. It is extremely fast even running on an ordinary laptop.

Here’s a more detailed example:

The bases are assumed to be pulled from a single strand of DNA, since the pairs are determined by one strand. The example in the code below uses a population of 100 people, each with 50 bases, creating a matrix with 50 rows (i.e., the number of bases per individual) and 100 columns (i.e., the number of individuals in the population). The next step is to find the modal base for each row, which would be the most frequent base at a given index in a sequence. Let’s assume for example there are 80 A’s and 20 G’s, in a given row. The modal base is A, and the density is 80%. We do this for every row, which will create a column vector of densities, with one entry / density for each row of the matrix. Now you read that vector in order, from index 1 to N, and if the transition from one row to the next changes by more than the standard deviation of the densities, you mark a delimiter between those two indexes.

So let’s assume the densities between two rows transition from 80% to 50%, and the standard deviation is 20%. We would mark a delimiter between those two rows, indicating the end of a gene, because we went from a base that many had in common, to a base that few had in common. The tacit assumption being, that genes will be common to a population, indicating signal, and everything else will be noise. By delimiting in this manner, we indicate the end of a gene and the commencement of noise.

Here’s the code:

https://www.dropbox.com/s/zo0x1u46qbt2fci/Genetic%20Delimter%20CMNDLINE.pdf?dl=0

 

Clustering Genetic Sequences Using A.I.

Following up on a previous post where I introduced an algorithm to find the longest genetic sequence common to a population, I’ve now applied my most basic clustering algorithm to genetic sequences, in order to classify populations based upon genetic data, and the results are excellent, at least so far, and I’d welcome additional datasets to test this on (my software is linked to below).

The basic idea is to use A.I. clustering, to uncover populations in genetic data. 

To test this idea, I created two populations, each based upon a unique genetic sequence that is 50 base pairs long, and each sequence was randomly generated, creating two totally random genetic sequences that are the “seeds” for the two populations. Each population has 100 individuals / sequences, and again, each sequence is 50 base pairs long. This ultimately creates a dataset that has 200 rows (i.e., individuals) and 50 columns (i.e., base pairs).

I then added noise to both populations, randomizing some percentage of the base pairs (e.g., swapping an A with a T). Population 1 has 2.5% of its base pairs effectively mutated using this process, and Population 2 has 5.0% of its base pairs mutated. This creates two populations that are internally heterogenous, due to mutation / noise, and of course distinct from one another (i.e., the original sequence for Populations 1 and 2 were randomly generated and are therefore distinct).

The A.I. task is to take the resultant dataset, generated by combining the two populations into a single dataset, and have the algorithm produce what are called “clusters”,  which in this case means that the algorithm looks at the genetic sequence for an individual, and pulls all other sequences that it believes to be related to that individual, without “cheating” and looking at the hidden label that ultimately allows us to calculate the accuracy of the algorithm.

The algorithm in question is interesting, because it is unsupervised, meaning that there is no training step, and so the algorithm starts completely blind, with no information at all, and has to nonetheless produce clusters. This would allow for genetic data to be analyzed without any human labels at all.

This is in my opinion quite a big deal, because you wouldn’t need someone to say ex ante, that this population is e.g., from Europe, and this one is from Asia. You would instead simply let the algorithm run, and determine on its own, what individuals belong together based upon the genetic information provided, with no human guidance at all.

In this case, the accuracy is perfect, despite the absence of a training step, which is actually typical of my software, and there’s a set of formal proofs I presented that guarantee perfect accuracy under reasonable conditions, that are of course violated in some cases, though this (i.e., genetic datasets) doesn’t seem to be one of them. However, I’ll note that the proofs I presented (and linked to) relate to the supervised version of this algorithm, whereas I’ve never been able to formally prove why this algorithm works, though empirically it most certainly works across a wide variety of datasets. It is however very similar to the supervised version, and you can read about the supervised version here.

Finally, I’ll note that because there are only 4 base pairs, the sequences are relatively short (i.e., only 50 basepairs long), and they’re randomly generated with random noise, it’s possible to produce disastrous accuracy, using my model datasets. That is, it’s possible that the initially generated population sequences are sufficiently similar, and then noise makes this problem even worse, producing a dataset that doesn’t really have two populations. And for this reason, I welcome real world datasets, as I’m new to genetics, though not new to A.I.

Finding Common Genetic Sequences in Linear Time

Begin with a sequence of DNA base pairs for each individual in a population of size M. Then separate these sequences into M single strands of DNA. Assume that each DNA strand contains N individual bases over S = (A, C, G, T). Now construct a matrix X where each column of X contains characters that define the DNA sequence of exactly one individual from the population. It follows that X is a matrix with N rows (i.e., the number of bases) and M columns (i.e., the number of individuals in the population). You can think of the matrix X as readable from the first row of a given column, down to the bottom row of that same column, which would define the genetic sequence for a given individual in the population.

Now construct a new matrix Y such that Y = F(X), where F maps S to \bar{S} = (1,2,3,4). That is, F maps each of the four possible bases S = (A, C, G, T) to the numbers 1 through 4, respectively. All we’ve done is encode the bases from the original matrix X using numbers. Now for each row of Y, calculate the density of each of the four bases in the row, and store those four resultant densities in a matrix Z, that has N rows and 4 columns. That is, each of the four possible bases will have some density in each row of Y, which is then stored in a corresponding row of Z, with each of the four columns of Z containing the densities of (A, C, G, T), respectively, for a given row of Y.

Further, construct a vector \bar{V} = max(Z) with a dimension of (N \times 1), where row i of V contains the maximum entry of row i of Z. That is, for every row of Z, which contains the densities for each of the four bases over every row, we find the base that is most dense for a given row, and store that density in the corresponding row of a new vector V. Then, construct a binary vector \bar{V} that maps every element of V to either 1 or 0, depending upon whether or not the entry in question is greater than some threshold in [0,1]. The threshold allows us to say that if e.g., the density of A in a given row exceeds 80\%, then we treat it as homogenous, and uniformly A, disregarding the balance of the entries that are not A’s. It follows that the longest sequence of consecutive 1's in \bar{V}, is the longest sequence of bases common to the entire population in question (subject to the threshold). All of these operations have in the worst case a linear runtime, or less when run in parallel. As a consequence, we can identify DNA sequences common to an entire population in worst case linear time.

The first step of mapping the bases to letters can be done independent of one another, and so this step has a constant runtime in parallel. The second step of calculating the densities can be accomplished in worst case linear time, for each row, since a sum over M operands can be done with at most M operations. The densities for each row can be calculated in parallel, and so this step requires a linear number of steps as a function of M. The next step of taking the maximum densities can again be done in parallel for each row, and requires 4 operations for a given row. The next step of comparing each maximum density for each row to a fixed threshold can again be done in parallel, and therefore requires constant time. The final step of finding the longest sequence requires exactly N operations. As a result, the total runtime of this algorithm, when run in parallel, is O(max(N,M)).

Attached is Octave code that implements this, and finds the longest genetic sequence common to a population (using the threshold concept). This can be modified trivially to find the next longest, and so on. This will likely allow for e.g., long sequences common to people with genetic diseases to be found, thereby allowing for at least the possibility of therapies.