Finding Common Genetic Sequences in Linear Time

November 1, 2022November 2, 2022 / erdosfan

Begin with a sequence of DNA base pairs for each individual in a population of size $M$ . Then separate these sequences into $M$ single strands of DNA. Assume that each DNA strand contains $N$ individual bases over $S = (A, C, G, T)$ . Now construct a matrix $X$ where each column of $X$ contains characters that define the DNA sequence of exactly one individual from the population. It follows that $X$ is a matrix with $N$ rows (i.e., the number of bases) and $M$ columns (i.e., the number of individuals in the population). You can think of the matrix $X$ as readable from the first row of a given column, down to the bottom row of that same column, which would define the genetic sequence for a given individual in the population.

Now construct a new matrix $Y$ such that $Y = F(X)$ , where $F$ maps $S$ to $\bar{S} = (1,2,3,4)$ . That is, $F$ maps each of the four possible bases $S = (A, C, G, T)$ to the numbers 1 through 4, respectively. All we’ve done is encode the bases from the original matrix $X$ using numbers. Now for each row of $Y$ , calculate the density of each of the four bases in the row, and store those four resultant densities in a matrix $Z$ , that has $N$ rows and $4$ columns. That is, each of the four possible bases will have some density in each row of $Y$ , which is then stored in a corresponding row of $Z$ , with each of the four columns of $Z$ containing the densities of $(A, C, G, T)$ , respectively, for a given row of $Y$ .

Further, construct a vector $\bar{V} = max(Z)$ with a dimension of $(N \times 1)$ , where row $i$ of $V$ contains the maximum entry of row $i$ of $Z$ . That is, for every row of $Z$ , which contains the densities for each of the four bases over every row, we find the base that is most dense for a given row, and store that density in the corresponding row of a new vector $V$ . Then, construct a binary vector $\bar{V}$ that maps every element of $V$ to either $1$ or $0$ , depending upon whether or not the entry in question is greater than some threshold in $[0,1]$ . The threshold allows us to say that if e.g., the density of A in a given row exceeds $80\%$ , then we treat it as homogenous, and uniformly A, disregarding the balance of the entries that are not A’s. It follows that the longest sequence of consecutive $1's$ in $\bar{V}$ , is the longest sequence of bases common to the entire population in question (subject to the threshold). All of these operations have in the worst case a linear runtime, or less when run in parallel. As a consequence, we can identify DNA sequences common to an entire population in worst case linear time.

The first step of mapping the bases to letters can be done independent of one another, and so this step has a constant runtime in parallel. The second step of calculating the densities can be accomplished in worst case linear time, for each row, since a sum over $M$ operands can be done with at most $M$ operations. The densities for each row can be calculated in parallel, and so this step requires a linear number of steps as a function of $M$ . The next step of taking the maximum densities can again be done in parallel for each row, and requires $4$ operations for a given row. The next step of comparing each maximum density for each row to a fixed threshold can again be done in parallel, and therefore requires constant time. The final step of finding the longest sequence requires exactly $N$ operations. As a result, the total runtime of this algorithm, when run in parallel, is $O(max(N,M))$ .

Attached is Octave code that implements this, and finds the longest genetic sequence common to a population (using the threshold concept). This can be modified trivially to find the next longest, and so on. This will likely allow for e.g., long sequences common to people with genetic diseases to be found, thereby allowing for at least the possibility of therapies.

density_based_sequencing_cmdnline-1-1 Download

Discover more from Information Overload

Subscribe to get the latest posts sent to your email.

Discover more from Information Overload

Share this:

Related

Leave a comment Cancel reply