[SOLVED] Regex matching Unicode variable names

Regex matching Unicode variable names

In Python 2, a Python variable name contains only ASCII letters, numbers and underscores, and it must not start with a number. Thus,

 re.search(r'[_a-zA-Z][_a-zA-Z0-9]*', s)

will find a matching Python name in the str s.

In Python 3, the letters are no longer restricted to ASCII. I am in search for a new regex which will match any and all legal Python 3 variable names.

According to the docs, \w in a regex will match any Unicode word literal, including numbers and the underscore. I am however unsure whether this character set contains exactly those characters which might be used in variable names.

Even if the character set \w contains exactly the characters from which Python 3 variable names may legally be constructed, how do I use it to create my regex? Using just \w+ will also match "words" which start with a number, which is no good. I have the following solution in mind,

re.search(r'(\w&[^0-9])\w*', s)

where & is the "and" operator (just like | is the "or" operator). The parentheses will thus match any word literal which at the same time is not a number. The problem with this is that the & operator does not exist, and so I'm stuck with no solution.

Edit

Though the "double negative" trick (as explained in the answer by Patrick Artner below) can also be found in this question, note that this only partly answers my question. Using [^\W0-9]\w* only works if I am guaranteed that \w exactly matches the legal Unicode characters, plus the numbers 0-9. I would like a source of this knowledge, or some other regex which gets the job done.

Solution

You can use a double negative - \W is anything that \w is not - just disallow it to allow any \w:

[^\W0-9]\w*

essentially using any not - non-wordcharacter except 0-9 followed by any word character any number of times.

Doku: regular-expression-syntax