cperlinline-c

Compare 2 arrays of string for matches in C optimization


I am having a perl script that has 2 arrays, 1 with keys and 1 with substring. I need to check if substring of 1 array have matches in the keys array. The amount of records is huge, something that can be counted in millions so I use Inline:C to speed up the search, however it is still taking hours to treat the records.

--Perl part

//%h contains {"AAAAA1" => 1, "BBBBBB" => 1, "BB1234" =>1, "C12345" => 1.... }
my @k=sort keys %h;
//@k contains ["AAAAA1", "BBBBBB", "BB1234", "C12345".... ]
my @nn;
//@n contains [ "AAAAA1999", "AAAAABBB134", "D123edae", "C12345CCSAER"]
// "AAAAA1" (from @k) can be found in "AAAAA1999" (in @n) = OK
foreach(@n) {
        my $res=array_search(\@k,$_);
        if($res) {
                $y++;
        } else {
                $z++;
                push @nn,$_;
        }
}

--C part

int fastcmp ( char *p1, char *p2 ) {
  while( *p1 ){
    char *a = p1, *b = p2;    
    if (*b != *a) return 0;
    ++p1; ++b;
  }
  return 1;
}

int array_search(AV *a1, SV *s1){
        STRLEN bytes1;
        char *p1,*p2,*n;
        long a1_size,i,c;
        a1_size = av_len(a1);
        p1 = SvPV(s1,bytes1);        
        for(i=start;i<=a1_size;++i){
            SV** elem = av_fetch(a1, i, 0);
            SV** elem_next = (i<a1_size-1)?av_fetch(a1, i+1, 0):elem;
            p2 = SvPV_nolen (*elem);
            n = SvPV_nolen (*elem_next);
            if (p1[0] == p2[0]) {
                if (fastcmp(p1,p2)>0) {
                    return i; 
                }
            }
            if ((p1[0] == p2[0]) && (p2[0] != n[0])) { return -1; }
        }
        return -1;
}

If somebody could help to optimize the search, that could be nice. Thanks.

Note: added comments to help what is inside each variables.


Solution

  • The implementation you have fails in many ways:

    It also incorrectly assumes the strings are null-ternminated, which can lead to a SEGFAULT when they aren't.


    Your approach scans @k for each element of @n, but if you build a trie (as the following code does), it can be scanned once.

    my $alt = join '|', map quotemeta, keys %h;
    my $re = qr/^(?:$alt)/;
    
    my @nn = sort grep !/$re/, @n;
    my $z = @nn;
    my $y = @n - @nn;
    

    For example, if there are 1,000 Ns and 1,000 Hs, your solution does up to 1,000,000 comparisons and mine does 1,000.

    Note that 5.10+ is needed for the regex optimisation of alternations into a trie. Regexp::List can be used on older versions.

    A proper C implementation will be a little faster because you can do a trie search using a function that does just that rather than using the regex engine.