Efficient substring matching in perl

Asked 6/6, 2011 at 22:53 Answered 7/6, 2011 at 6:24

Solved perl string string-matching substring bioperl

I am looking for an efficient solution to do find the longest possible substring in a string tolerating n mismatches in the main string

Eg: Main String

AGACGTACTACTCTACTAGATGCA*TACTCTAC*
AGACGTACTACTCTACTAGATGCA*TACTCTAC*
AGACGTACTACTCTACAAGATGCA*TACTCTAC*
AGACGTACTACTTTACAAGATGCA*TACTCTAC*

Search String:

TACTCTACT : this should be considered a match to all of the above main strings.

Also I there could be the case where part of the substring is at the end of main string and I would like to pick that up also.

I would appreciate if you could give some pointers.

PS: I will have one search string and about 100 million main strings to search the substring for.

Thanks! -Abhi

Civic answered 6/6, 2011 at 22:53 Comment(7)

Questions. 1) Your numbers in "Main String" - are those separate lines in one "Main String", or four separate "Main String"s (I assume the latter, but want to confirm)? 2) Can you explain why that search string would be considered a match for #3 and #4? The first two clearly match, but it's not clear to me why the last two do, and that sounds like it may be a crucial detail to understand. 3) Can you explain what you mean that part of the substring may be at the end of the main string? – Zook 6/6, 2011 at 23:27

You mention "longest", but your example doesn't provide any clue as to what you mean by longest. Do you mean with the fewest mismatch? – Temekatemerity 6/6, 2011 at 23:48

@Brian Gerard: 1) Those are 4 of the million main strings. 2) Because he tolerates mismatches. #3: A for T, #4, T for C, A for T. 3) "...TACT" should end-of-line-match search string TACTCTACT, because it might be followed by "CTACT". – Temekatemerity 7/6, 2011 at 0:12

@Brian : 1.) I showed 4 different types of main strings for example purpose. 2.) I am allowing for n mismatches. In this case n = 1,2 etc. 3.) the search string could be partially contained in the main string and we want to match that partial case also. Thanks for your questions – Civic 7/6, 2011 at 1:58

@Temekatemerity : By longest I mean the literally the "longest" and best possible match ( least mismatches). In way this could be misleading and thanks for asking. So longest might not be the right word. What I meant was if the substring is partially contained at the end of main string I would like to pick that case and not ingore it. I hope this makes it a bit more clearer. – Civic 7/6, 2011 at 2:0

TACTCTAC* is a better match than TACTTTACA in your fourth example – Temekatemerity 7/6, 2011 at 6:19

@Temekatemerity : thanks for your help so far. I have another follow up question. Previously I had one query string compared to millions of search string one at a time for a substring occurence. Now if I want to increase my search space to include multiple query strings, I can either run the same logic again but can I use other data structures which can build something like a suffix tree for query strings and use that for searching substring in the search string ?? Thanks! – Civic 14/6, 2011 at 23:52

use strict;
use warnings;
use feature qw( say );

sub match {
   my ($s, $t, $max_x) = @_;

   my $m = my @s = unpack('(a)*', $s);
   my $n = my @t = unpack('(a)*', $t);

   my @length_at_k     = ( 0 ) x ($m+$n);
   my @mismatches_at_k = ( 0 ) x ($m+$n);
   my $offset = $m;

   my $best_length = 0;
   my @solutions;
   for my $i (0..$m-1) {
      --$offset;

      for my $j (0..$n-1) {
         my $k = $j + $offset;

         if ($s[$i] eq $t[$j]) {
            ++$length_at_k[$k];
         }
         elsif ($length_at_k[$k] > 0 && $mismatches_at_k[$k] < $max_x) {
            ++$length_at_k[$k];
            ++$mismatches_at_k[$k];
         }
         else {
            $length_at_k[$k] = 0;
            $mismatches_at_k[$k] = 0;
         }

         my $length = $length_at_k[$k] + $max_x - $mismatches_at_k[$k];
         $length = $i+1 if $length > $i+1;

         if ($length >= $best_length) {
            if ($length > $best_length) {
               $best_length = $length;
               @solutions = ();
            }

            push @solutions, $i-$length+1;
         }
      }
   }

   return map { substr($s, $_, $best_length) } @solutions;
}

say for match('AABBCC', 'DDBBEE', 2);

Output:

AABB
ABBC
BBCC

Temekatemerity answered 7/6, 2011 at 6:24 Comment(3)

you code works absolutely fine. Just wanted to get some idea if the number of sequences that I am checking in a certain amount time sounds feasible or is there a way to scale it up. I am searching for a 36 alphabets substring (with upto 5 mismatches) in 90 million 100 alphabets long search sequences. It is taking me about 13 hours to do this search. Does this sounds like normal based on your experience. I haven't time/memory profiled my code. Thanks! – Civic 9/6, 2011 at 18:32

@Abhi, I'm no expert in algorithms, but I took every shortcut I could think of. You could make it an order or two faster by implementing it in C. – Temekatemerity 9/6, 2011 at 19:19

@Mariya, It's based on a "dynamic programming" solution to the Longest Common Substring problem. See an explanation of what it's based on here. – Temekatemerity 9/12, 2011 at 21:41

I'm not certain that this is what you're after but BioPerl has an approximate-grep tool called Bio::Grep::Backend::Agrep:

Bio::Grep::Backend::Agrep searches for a query with agrep

And agrep is "approximate grep". AFAIK, this builds a database and then uses that database to make your searches faster so this sounds like what you're after.

Looks like you're working with DNA sequences so you probably have BioPerl already installed.

You could also try using String::Approx directly:

Perl extension for approximate matching (fuzzy matching)

I suspect that Bio::Grep::Backend::Agrep will be faster and a better match for your task though.

Hinson answered 6/6, 2011 at 23:27 Comment(1)

I will try this and let you guys know it that works and is efficient. – Civic 7/6, 2011 at 2:1

use strict;
use warnings;
use feature qw( say );

sub match {
   my ($s, $t, $max_x) = @_;

   my $m = my @s = unpack('(a)*', $s);
   my $n = my @t = unpack('(a)*', $t);

   my @length_at_k     = ( 0 ) x ($m+$n);
   my @mismatches_at_k = ( 0 ) x ($m+$n);
   my $offset = $m;

   my $best_length = 0;
   my @solutions;
   for my $i (0..$m-1) {
      --$offset;

      for my $j (0..$n-1) {
         my $k = $j + $offset;

         if ($s[$i] eq $t[$j]) {
            ++$length_at_k[$k];
         }
         elsif ($length_at_k[$k] > 0 && $mismatches_at_k[$k] < $max_x) {
            ++$length_at_k[$k];
            ++$mismatches_at_k[$k];
         }
         else {
            $length_at_k[$k] = 0;
            $mismatches_at_k[$k] = 0;
         }

         my $length = $length_at_k[$k] + $max_x - $mismatches_at_k[$k];
         $length = $i+1 if $length > $i+1;

         if ($length >= $best_length) {
            if ($length > $best_length) {
               $best_length = $length;
               @solutions = ();
            }

            push @solutions, $i-$length+1;
         }
      }
   }

   return map { substr($s, $_, $best_length) } @solutions;
}

say for match('AABBCC', 'DDBBEE', 2);

Output:

AABB
ABBC
BBCC

Temekatemerity answered 7/6, 2011 at 6:24 Comment(3)

@Abhi, I'm no expert in algorithms, but I took every shortcut I could think of. You could make it an order or two faster by implementing it in C. – Temekatemerity 9/6, 2011 at 19:19

@Mariya, It's based on a "dynamic programming" solution to the Longest Common Substring problem. See an explanation of what it's based on here. – Temekatemerity 9/12, 2011 at 21:41

Recommended topics

Hot tags