phpcmetaphone

PHP metaphone implementation bug


I'm testing a metaphone implementation for C# and comparing its results against the built-in metaphone() function from PHP. However, I've come across a bug (which is previously documented in PHP's issue tracker and discussed on a mailing list), but I'm trying to understand the C code behind their bug for my own personal interest.

Basically, according to the metaphone algorithm, most instances of -gh- should be rendered silent. In the specific test case of "wright", I expect (and generate with my own algorithm) a metaphone key of "RT"

"wr" => R
"i"  => ignored
"gh" => ignored
"t"  => T

Result: RT

However, PHP's metaphone function returns RFT. Clearly, it's converting the -gh- to an F, as if it were at the end of a word (e.g. "rough"), but in the case of the word "wright", this is incorrect, because the -gh- does not come at the end of the word. Looking at the metaphone.c file in the PHP source distribution, I see a few key things:

/* These prevent GH from becoming F */
#define NOGHTOF(c)  (ENCODE(c) & 16)    /* BDH */

...

/* Go N letters back. */
#define Look_Back_Letter(n) (w_idx >= n ? toupper(word[w_idx-n]) : '\0')

And then on line 342:

case 'G':
    if (Next_Letter == 'H') {
        if (!(NOGHTOF(Look_Back_Letter(3)) || Look_Back_Letter(4) == 'H')) {
            Phonize('F');
            skip_letter++;

Can someone help me understand what exactly the NOGHTOF function does and why this code is incorrectly rendering an F for the -gh- in "wright"? I'm not really a C guy, so the code isn't at all clear to me.


Solution

  • The meaning of NOGHTOF(c) is actually determined by the code starting at line 81:

    char _codes[26] = {
            1, 16, 4, 16, 9, 2, 4, 16, 9, 2, 0, 2, 2, 2, 1, 4, 0, 2, 4, 4, 1, 0, 0, 0, 8, 0
        /*  a  b   c  d   e  f  g  h   i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z */
    };
    
    #define ENCODE(c) (isalpha(c) ? _codes[((toupper(c)) - 'A')] : 0)
    

    Essentially, a value is assigned for each letter of the alphabet in order (A = 1, B = 16, etc.) Then ENCODE macro checks whether the passed character is a letter; if yes, it returns the corresponding code for that letter, otherwise it returns the null character. (It doesn't really return anything, as this is a macro and is substituted by the compiler at compile time to replace the actual call.)

    The way I'm reading the code for 'G' is this (without trying to understand why):

    If current letter is G then
        If next letter is H then
            Take "_code" value of a letter three letters back (why?) from the _codes table and check the fifth bit (from the back, naturally)
            If this bit is not set OR if a letter four letters back (why?) is 'H' then
                Add 'F' to the result
                skip one more character (letter 'H' following the 'G')
    

    Why it is like this is beyond me though, I'm quite sure somebody had a good reason to write it this way, but it seems an obvious bug to me.