pythonregexcharacter-properties

Match any unicode letter?


In .net you can use \p{L} to match any letter, how can I do the same in Python? Namely, I want to match any uppercase, lowercase, and accented letters.


Solution

  • Python's re module doesn't support Unicode properties yet. But you can compile your regex using the re.UNICODE flag, and then the character class shorthand \w will match Unicode letters, too.

    Since \w will also match digits, you need to then subtract those from your character class, along with the underscore:

    [^\W\d_]
    

    will match any Unicode letter.

    >>> import re
    >>> r = re.compile(r'[^\W\d_]', re.U)
    >>> r.match('x')
    <_sre.SRE_Match object at 0x0000000001DBCF38>
    >>> r.match(u'é')
    <_sre.SRE_Match object at 0x0000000002253030>