phpjavascriptregexcharacter-properties

Regex for names with special characters (Unicode)


Okay, I have read about regex all day now, and still don't understand it properly. What i'm trying to do is validate a name, but the functions i can find for this on the internet only use [a-zA-Z], leaving characters out that i need to accept to.

I basically need a regex that checks that the name is at least two words, and that it does not contain numbers or special characters like !"#¤%&/()=..., however the words can contain characters like æ, é, Â and so on...

An example of an accepted name would be: "John Elkjærd" or "André Svenson"
An non-accepted name would be: "Hans", "H4nn3 Andersen" or "Martin Henriksen!"

If it matters i use the javascript .match() function client side and want to use php's preg_replace() only "in negative" server side. (removing non-matching characters).

Any help would be much appreciated.

Update:
Okay, thanks to Alix Axel's answer i have the important part down, the server side one.

But as the page from LightWing's answer suggests, i'm unable to find anything about unicode support for javascript, so i ended up with half a solution for the client side, just checking for at least two words and minimum 5 characters like this:

if(name.match(/\S+/g).length >= minWords && name.length >= 5) {
  //valid
}

An alternative would be to specify all the unicode characters as suggested in shifty's answer, which i might end up doing something like, along with the solution above, but it is a bit unpractical though.


Solution

  • Try the following regular expression:

    ^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$
    

    In PHP this translates to:

    if (preg_match('~^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$~u', $name) > 0)
    {
        // valid
    }
    

    You should read it like this:

    ^   # start of subject
        (?:     # match this:
            [           # match a:
                \p{L}       # Unicode letter, or
                \p{Mn}      # Unicode accents, or
                \p{Pd}      # Unicode hyphens, or
                \'          # single quote, or
                \x{2019}    # single quote (alternative)
            ]+              # one or more times
            \s          # any kind of space
            [               #match a:
                \p{L}       # Unicode letter, or
                \p{Mn}      # Unicode accents, or
                \p{Pd}      # Unicode hyphens, or
                \'          # single quote, or
                \x{2019}    # single quote (alternative)
            ]+              # one or more times
            \s?         # any kind of space (0 or more times)
        )+      # one or more times
    $   # end of subject
    

    I honestly don't know how to port this to Javascript, I'm not even sure Javascript supports Unicode properties but in PHP PCRE this seems to work flawlessly @ IDEOne.com:

    $names = array
    (
        'Alix',
        'André Svenson',
        'H4nn3 Andersen',
        'Hans',
        'John Elkjærd',
        'Kristoffer la Cour',
        'Marco d\'Almeida',
        'Martin Henriksen!',
    );
    
    foreach ($names as $name)
    {
        echo sprintf('%s is %s' . "\n", $name, (preg_match('~^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$~u', $name) > 0) ? 'valid' : 'invalid');
    }
    

    I'm sorry I can't help you regarding the Javascript part but probably someone here will.


    Validates:

    Invalidates:


    To replace invalid characters, though I'm not sure why you need this, you just need to change it slightly:

    $name = preg_replace('~[^\p{L}\p{Mn}\p{Pd}\'\x{2019}\s]~u', '$1', $name);
    

    Examples:

    Note that you always need to use the u modifier.