phpregexaspell

Regexp and pspell_check with UTF-8 (Umlaute)


I'm having trouble with this piece of code. What it should do is take a string, split it by word, then check it against a dictionary. However when the string contains an "Umlaut" ÄäÖöÜü it splits it there.

I'm pretty sure the problem is [A-ZäöüÄÖÜ\'] it seems i'm including the special charackters wrong, but how?

$string = "Rechtschreibprüfung";      
preg_match_all("/[A-ZäöüÄÖÜ\']{1,16}/i", $string, $words);
for ($i = 0; $i < count($words[0]); ++$i) {
    if (!pspell_check($pspell_link, $words[0][$i])) {
        $array[] = $words[0][$i];            
    }
}

result:

$array[0] = Rechtschreibprü"
$array[1] = "fung"

Solution

  • To match a chunk of Unicode letters, you can use

    '/\p{L}+/u'
    

    The \p{L} matches any Unicode letter, + matches one or more occurrenes of the preceding subpattern and the /u modifier treats the pattern and string as Unicode strings.

    To only match whole words, use word boundaries:

    '/\b\p{L}+\b/u'
    

    If you have diacritics, also add \p{M}:

    '/\b[\p{M}\p{L}]+\b/u'