phpregexemailidn

preg_match verification of non English email addresses (international domain names)


We all know email address verification is a touchy subject, there are so many opinions on the best way to deal with it without encoding for the entire RFC. But since 2009 its become even more difficult and I haven't really seen anyone address the issue of IDN's yet.

Here is what I've been using:

preg_match(/^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,6}\z/i)

Which will work for most email addresses but what if I need to match a non Latin email address? e.g.: bob@china.中國, or bob@russia.рф

Look here for the complete list. (Notice all the non Latin domain extensions at the bottom of the list.)

Information on this subject can be found here and I think what they are saying is these new characters will simply be read as '.xn--fiqz9s' and '.xn--p1ai' on the machine level but I'm not 100% sure.

If it is, does that mean the only change I need to consider making in my code the following? (For domain extensions like .travelersinsurance and .sandvikcoromant)

preg_match(/^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,20}\z/i)

NOTICE: This is not related to the discussion found on this page Using a regular expression to validate an email address


Solution

  • Here is what I eventually came up with.

    preg_match(/^[\pL\pM*+\pN._%+-]+@[\pL\pM*+\pN.-]+\.[\pL\pM*+]{2,20}\z/u)
    

    This uses Unicode regular expressions like \pL, \pM*+ and \pN to help me deal with characters and numbers from any language.

    \pL Any kind of letter from any language, upper or lower case.

    \pM*+ Matches zero or more code points that are combining marks. A character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

    \pN Any number.

    The expression above will work perfectly for normal email addresses like me@mydomain.com and cacophonous email addresses like a.s中3_yÄhমহাজোটেরoo文%网+d-fελληνικά@πyÄhooαράδειγμα.δοκιμή.

    It's not that I don't trust people to be able to type in their own email addresses but people do make mistakes and I may use this code in other situations. For example: I need to double check the integrity of an existing list of 10,000 email addresses. Besides, I was always taught to NOT trust user input and to ALWAYS filter.

    UPDATE

    I just discovered that though this works perfectly when tested on sites like phpliveregex.com and locally when parsing a normal string for utf-8 content it doesn't work properly with email fields because browsers converting fields of that content type to normal latin. So an email address like bob@china.中國, or bob@russia.рф does get converted before being received by the server to bob@china.xn--fiqz9s, or bob@russia.xn--p1ai. The only thing I was really missing from my original filter was the inclusion of hyphens from the domain extention.

    Here is the final version:

    preg_match('/^[a-z0-9%+-._]+@[a-z0-9-.]+\.[a-z0-9-]{2,20}\z/i');