regexc#globalization

Regular expression for validating names and surnames?


Although this seems like a trivial question, I am quite sure it is not :)

I need to validate names and surnames of people from all over the world. Imagine a huge list of miilions of names and surnames where I need to remove as well as possible any cruft I identify. How can I do that with a regular expression? If it were only English ones I think that this would cut it:

^[a-z -']+$

However, I need to support also these cases:

Is there a standard way of validating these fields I can implement to make sure that our website users have a great experience and can actually use their name when registering in the list?

I would be looking for something similar to the many "email address" regexes that you can find on google.


Solution

  • I'll try to give a proper answer myself:

    The only punctuations that should be allowed in a name are full stop, apostrophe and hyphen. I haven't seen any other case in the list of corner cases.

    Regarding numbers, there's only one case with an 8. I think I can safely disallow that.

    Regarding letters, any letter is valid.

    I also want to include space.

    This would sum up to this regex:

    ^[\p{L} \.'\-]+$
    

    This presents one problem, i.e. the apostrophe can be used as an attack vector. It should be encoded.

    So the validation code should be something like this (untested):

    var name = nameParam.Trim();
    if (!Regex.IsMatch(name, "^[\p{L} \.\-]+$")) 
        throw new ArgumentException("nameParam");
    name = name.Replace("'", "'");  //' does not work in IE
    

    Can anyone think of a reason why a name should not pass this test or a XSS or SQL Injection that could pass?


    complete tested solution

    using System;
    using System.Text.RegularExpressions;
    
    namespace test
    {
        class MainClass
        {
            public static void Main(string[] args)
            {
                var names = new string[]{"Hello World", 
                    "John",
                    "João",
                    "タロウ",
                    "やまだ",
                    "山田",
                    "先生",
                    "мыхаыл",
                    "Θεοκλεια",
                    "आकाङ्क्षा",
                    "علاء الدين",
                    "אַבְרָהָם",
                    "മലയാളം",
                    "상",
                    "D'Addario",
                    "John-Doe",
                    "P.A.M.",
                    "' --",
                    "<xss>",
                    "\""
                };
                foreach (var nameParam in names)
                {
                    Console.Write(nameParam+" ");
                    var name = nameParam.Trim();
                    if (!Regex.IsMatch(name, @"^[\p{L}\p{M}' \.\-]+$"))
                    {
                        Console.WriteLine("fail");
                        continue;
                    }
                    name = name.Replace("'", "&#39;");
                    Console.WriteLine(name);
                }
            }
        }
    }