perlstringstring-matchingsubstringbioperl

Efficient substring matching in perl


I am looking for an efficient solution to do find the longest possible substring in a string tolerating n mismatches in the main string

Eg: Main String

  1. AGACGTACTACTCTACTAGATGCA*TACTCTAC*
  2. AGACGTACTACTCTACTAGATGCA*TACTCTAC*
  3. AGACGTACTACTCTACAAGATGCA*TACTCTAC*
  4. AGACGTACTACTTTACAAGATGCA*TACTCTAC*

Search String:

  1. TACTCTACT : this should be considered a match to all of the above main strings.

Also I there could be the case where part of the substring is at the end of main string and I would like to pick that up also.

I would appreciate if you could give some pointers.

PS: I will have one search string and about 100 million main strings to search the substring for.

Thanks! -Abhi


Solution

  • use strict;
    use warnings;
    use feature qw( say );
    
    sub match {
       my ($s, $t, $max_x) = @_;
    
       my $m = my @s = unpack('(a)*', $s);
       my $n = my @t = unpack('(a)*', $t);
    
       my @length_at_k     = ( 0 ) x ($m+$n);
       my @mismatches_at_k = ( 0 ) x ($m+$n);
       my $offset = $m;
    
       my $best_length = 0;
       my @solutions;
       for my $i (0..$m-1) {
          --$offset;
    
          for my $j (0..$n-1) {
             my $k = $j + $offset;
    
             if ($s[$i] eq $t[$j]) {
                ++$length_at_k[$k];
             }
             elsif ($length_at_k[$k] > 0 && $mismatches_at_k[$k] < $max_x) {
                ++$length_at_k[$k];
                ++$mismatches_at_k[$k];
             }
             else {
                $length_at_k[$k] = 0;
                $mismatches_at_k[$k] = 0;
             }
    
             my $length = $length_at_k[$k] + $max_x - $mismatches_at_k[$k];
             $length = $i+1 if $length > $i+1;
    
             if ($length >= $best_length) {
                if ($length > $best_length) {
                   $best_length = $length;
                   @solutions = ();
                }
    
                push @solutions, $i-$length+1;
             }
          }
       }
    
       return map { substr($s, $_, $best_length) } @solutions;
    }
    
    say for match('AABBCC', 'DDBBEE', 2);
    

    Output:

    AABB
    ABBC
    BBCC