phpregexunicodecollationdiacritics

php regex match similar to letters. Aka u=ü or ê=é=è=e


I'm working a way to search for specific words in a text and highlight them. The code works perfectly, except I would like that it also matches similar letters. I mean, searching for fête should match fêté, fete, ...

Is there an easy & elegant way to do this?

This is my current code:

$regex='/(' . preg_replace('/\s+/', '|', preg_quote($usersearchstring)) .')/iu';

$higlightedtext = preg_replace($regex, '<span class="marked-search-text">\0</span>', $text);

My text is not html encoded. And searching in MariaDB matches the similar results.

[edit] And here a longer example of the issue:

$usersearchstring='fête';
$text='la paix fêtée avec plus de 40 cultures';
$regex='/(' . preg_replace('/\s+/', '|', preg_quote($usersearchstring)) .')/iu';
$higlightedtext = preg_replace($regex, '<span class="marked-search-text">\0</span>', $text);

Result is that $higlightedtext is identical to $text

When changing $higlightedtext the word "fêté" then $higlightedtext is

'la paix <span class="marked-search-text">fêté</span>e avec plus de 40 cultures'

However, I want it to match "always" all the variations of letters, since there can be (and are in reality) many variations of the words possible. And we have fête fêté and possible even fete in the database.

And I have been thinking about this, but the only solution I see is to have an huge array with all letter replacement options, then loop over them and try every variation. But that is not elegant and will be slow.(Since for many letters I have at least 5 variations: aáàâä, resulting in, if the word has 3 vowels that I need to do 75x (5x5x5) the preg_replace.

[/edit]


Solution

  • Your question is about collation, the art of handling natural-language text to order and compare it using knowledge about languages' lexical rules. You're looking for case-insensitive and diacritical-mark-insensitive collation.

    A common collation rule is B comes after A. A less common rule, but important to your question, is ê and e are equivalent. Collations contain lots of rules like these, worked out carefully over years. If you're using case-insensitive collation, you want rules like a and A are equivalent.

    A diacritical rule that's true in most European languages, but not Spanish, is this: Ñ and N are equivalent. In Spanish, Ñ comes after N.

    Modern databasese know about these collations. If you use MySQL for example, you can set up a column with a character encoding of utf8mb4 and a collation of utf8mb4_unicode_ci. This will do a good job with most languages (but not perfect for Spanish).

    Regex technology is not very useful for collation work. If you use regex for this you're trying to reinvent the wheel, and you're likely to reinvent the flat tire instead.

    PHP, like most modern programming languages, contains collation support, built in to its Collator class. Here's a simple example of the use of a Collator object for your accented-character use case. It uses the Collator::PRIMARY collation strength to perform the case- and accent- insensitive comparison.

    mb_internal_encoding("UTF-8");
    $collator  = collator_create('fr_FR');
    $collator->setStrength(Collator::PRIMARY);
    $str1 = mb_convert_encoding('fêté', 'UTF-8');
    $str2 = mb_convert_encoding('fete', 'UTF-8');
    $result = $collator->compare($str1, $str2);
    echo $result;
    

    The $result here is zero, meaning the strings are equal. That's what you want.

    If you want to search for matching substrings within a string this way you need to do so with explicit substring matching. Regex technology doesn't provide that.

    Here's a function to do the search and annotation (adding of <span> tags, for example). It takes full advantage of the Collator class's schemes for character equality.

    function annotate_ci ($haystack, $needle, $prefix, $suffix, $locale="FR-fr") {
    
        $restoreEncoding = mb_internal_encoding();
        mb_internal_encoding("UTF-8");
        $len = mb_strlen($needle);
        if ( mb_strlen( $haystack ) < $len ) {
            mb_internal_encoding($restoreEncoding);
            return $haystack;
        }
        $collator = collator_create( $locale );
        $collator->setStrength( Collator::PRIMARY );
    
        $result = "";
        $remain = $haystack;
        while ( mb_strlen( $remain ) >= $len ) {
            $matchStr = mb_substr($remain, 0, $len);
            $match = $collator->compare( $needle, $matchStr );
            if ( $match == 0 ) {
                /* add the matched $needle string to the result, with annotations.
                 * take the matched string from $remain
                 */
                $result .= $prefix . $matchStr . $suffix;
                $remain = mb_substr( $remain, $len );
            } else {
                /* add one char to $result, take one from $remain */
                $result .= mb_substr( $remain, 0, 1 );
                $remain = mb_substr( $remain, 1 );
            }
        }
        $result .= $remain;
        mb_internal_encoding($restoreEncoding);
        return $result;
    }
    

    And here's an example of the use of that function.

    $needle = 'Fete';  /* no diacriticals here! mixed case! */
    $haystack= mb_convert_encoding('la paix fêtée avec plus de 40 cultures', 'UTF-8');
    
    $result = annotate_ci($haystack, $needle, 
                          '<span class="marked-search-text">' , '</span>');
    

    It gives back

     la paix <span class="marked-search-text">fêté</span>e avec plus de 40 cultures