regexunicodelanguage-agnosticnon-english

How can I create an alphanumeric Regex for all languages?


I had this problem today:

This regex matches only English: [a-zA-Z0-9].

If I need support for any language in this world, what regex should I write?


Solution

  • If you use character class shorthands and a Unicode aware regex engine you can do that. The \w class matches "word characters" (letters, digits, and underscores).

    Beware of some regex flavors that don't do this so well: JavaScript uses ASCII for \d (digits) and \w, but Unicode for \s (whitespace). XML does it the other way around.