I noted yesterday that not only is local alignment intractable (if you treat every base index as a potential insertion / deletion), it doesn’t seem to matter much compared to global alignment. I actually tested this experimentally on the mtDNA dataset I’ve put together, and it turns out that in the best case, you add an average of 41.42 matching bases by accounting for local alignment. Specifically, I ran Nearest Neighbor on every genome, and then accounted for all sequences of unequal bases between a given genome and its nearest neighbor, that were at least 10 bases long. This is necessary otherwise you end up shifting bases locally in sequences that are short, and therefore could be the result of chance. In some cases local alignment does add a significant number of bases, and you can see that in the chart below that shows the maximum percentage increase that could be achieved using local alignments over sequences of at least 10 bases long. However, you can also see, it’s extremely rare, and typically around 0 bases. The x-axis shows the genome index in the dataset, and the y-axis shows the ratio between (a) the extra matching bases due to local alignment divided by (b) the original matching bases count, which peaks around 15% of the genome size. That is, it shows the percentage increase in matching bases due to accounting for local alignment. Moreover, this is the absolute maximum, that assumes a single shift in a sequence of M bases will produce M-1 matching bases, which is obviously not guaranteed. The plain take away is, unless you’re looking for genes, or looking for insertions and deletions, local alignment is probably not important. Specifically, my entire model is predicated upon allowing a machine to examine an entire genome, rather than look at individual genes or regions. Ironically, this algorithm is probably useful for identifying potential insertions and deletions, for the simple reason that it identifies significantly long sequences of unequal bases.

Here’s the dataset:
https://www.dropbox.com/s/ht5g2rqg090himo/mtDNA.zip?dl=0
All genomes are taken from the NIH, and each has a provenance file that links to the NIH Database.
Here’s the code:
https://www.dropbox.com/s/sz032x9py1vk3qf/Calc_Max_Match.m?dl=0
https://www.dropbox.com/s/npnz8ljzoh3sxzy/Max_Match_Count_CMNDLINE.m?dl=0
