Genetic Delimiting

I’ve written a simple delimiter algorithm that allows you to delimit genetic sequences. The basic idea is, if a base is sufficiently dense at an index in a population, then it could be part of a gene, and so a sequence of sufficiently dense bases in order, is therefore useful to identify, since that sequence of bases could therefore form a gene.

The basic idea for this algorithm is really simple: calculate the densities of the bases in a population, at each index in a sequence, then calculate the standard deviation of those densities.

Now read the densities in the order of the underlying genetic sequence, and if you see the density change by more than the standard deviation, mark that index with a delimiter. So if e.g., the standard deviation is 20%, and there’s a jump from 90% density to 50% density, then you put a delimiter between the 90% and 50% dense bases, marking the end of a gene.

All of the operations in the code can be run in parallel, and so it has a constant runtime, on a parallel machine. It is extremely fast even running on an ordinary laptop.

Here’s a more detailed example:

The bases are assumed to be pulled from a single strand of DNA, since the pairs are determined by one strand. The example in the code below uses a population of 100 people, each with 50 bases, creating a matrix with 50 rows (i.e., the number of bases per individual) and 100 columns (i.e., the number of individuals in the population). The next step is to find the modal base for each row, which would be the most frequent base at a given index in a sequence. Let’s assume for example there are 80 A’s and 20 G’s, in a given row. The modal base is A, and the density is 80%. We do this for every row, which will create a column vector of densities, with one entry / density for each row of the matrix. Now you read that vector in order, from index 1 to N, and if the transition from one row to the next changes by more than the standard deviation of the densities, you mark a delimiter between those two indexes.

So let’s assume the densities between two rows transition from 80% to 50%, and the standard deviation is 20%. We would mark a delimiter between those two rows, indicating the end of a gene, because we went from a base that many had in common, to a base that few had in common. The tacit assumption being, that genes will be common to a population, indicating signal, and everything else will be noise. By delimiting in this manner, we indicate the end of a gene and the commencement of noise.

Here’s the code:

https://www.dropbox.com/s/zo0x1u46qbt2fci/Genetic%20Delimter%20CMNDLINE.pdf?dl=0

 


Discover more from Information Overload

Subscribe to get the latest posts sent to your email.

Leave a comment