javascriptregexstringreplaceunicode-escapes

Javascript regex validation for non latin characters with few few symbols whitelist


I'm trying to create a validation rules for username in two steps:

  1. Detect if strings contains any non latin characters. All non albhabetic symbols/numbers/whitespaces are allowed.
  2. Detect if string contains any symbols which are not in the whitelist (' - _ `). All latin/non latin characters/numbers/whitespaces are allowed.

I thought it would be easy, but I was wrong...

  1. For the first case I've tried to remove latin characters/numbers/whitespaces from the string:

str.replace(/[A-Za-z0-9\s]/g, '')

With such rule from "Xxx z 88A ююю 4$??!!" I will get "ююю$??!!". But how to remove all symbols ("ююю" should stay)?

  1. For the second case I've tried to remove latin characters/numbers/whitespaces/symbols from whitelist(' - _ `) with str.replace(/[A-Za-z0-9-_`\s]/g, ''), but I don't know hot to remove non latin characters.

Summary: My main problem is to detect non latin characters and separate them from special symbols.

UPDATE: Ok, for my second case I can use:

str.replace(/[\u0250-\ue007]/g, '').replace(/[A-Za-z0-9-_`\s]/g, '')

It works, but looks dirty... Pardon for backticks.


Solution

  • For the first problem, eliminating a-z, 0-9, whitespace, symbols and puncutation, you need to know some unicode tricks.

    1. you can reference unicode sets using the \p option. Symbols are S, punctuation is P.

    2. to use this magic, you need to add the u modifier to the regex.

    That gives us:

    /([a-z0-9]|\s|\p{S}|\p{P})/giu
    

    (I added the i because then I don't have to write A-Z as well as a-z.)

    Since you have a solution for your second problem, I'll leave that with you.