pythonkeywordidentifierreserved

How to check if a string is a valid python identifier? including keyword check?


Is there any builtin python method that will check if something is a valid python variable name, INCLUDING a check against reserved keywords? (i.e., something like 'in' or 'for' would fail)

Failing that, where can I get a list of reserved keywords (i.e., dynamically, from within python, as opposed to copy-and-pasting something from the online docs)? Or, is there a good way of writing your own check?

Surprisingly, testing by wrapping a setattr in try/except doesn't work, as something like this:

setattr(myObj, 'My Sweet Name!', 23)

...actually works! (...and can even be retrieved with getattr!)


Solution

  • Python 3

    Python 3 now has 'foo'.isidentifier(), so that seems to be the best solution for recent Python versions (thanks fellow runciter@freenode for suggestion). However, somewhat counter-intuitively, it does not check against the list of keywords, so combination of both must be used:

    import keyword
    
    def isidentifier(ident: str) -> bool:
        """Determines if string is valid Python identifier."""
    
        if not isinstance(ident, str):
            raise TypeError("expected str, but got {!r}".format(type(ident)))
    
        if not ident.isidentifier():
            return False
    
        if keyword.iskeyword(ident):
            return False
    
        return True
    

    Python 2

    For Python 2, easiest possible way to check if given string is valid Python identifier is to let Python parse it itself.

    There are two possible approaches. Fastest is to use ast, and check if AST of single expression is of desired shape:

    import ast
    
    def isidentifier(ident):
        """Determines, if string is valid Python identifier."""
    
        # Smoke test — if it's not string, then it's not identifier, but we don't
        # want to just silence exception. It's better to fail fast.
        if not isinstance(ident, str):
            raise TypeError("expected str, but got {!r}".format(type(ident)))
    
        # Resulting AST of simple identifier is <Module [<Expr <Name "foo">>]>
        try:
            root = ast.parse(ident)
        except SyntaxError:
            return False
    
        if not isinstance(root, ast.Module):
            return False
    
        if len(root.body) != 1:
            return False
    
        if not isinstance(root.body[0], ast.Expr):
            return False
    
        if not isinstance(root.body[0].value, ast.Name):
            return False
    
        if root.body[0].value.id != ident:
            return False
    
        return True
    

    Another is to let tokenize module split the identifier into the stream of tokens, and check it only contains our name:

    import keyword
    import tokenize
    
    def isidentifier(ident):
        """Determines if string is valid Python identifier."""
    
        # Smoke test - if it's not string, then it's not identifier, but we don't
        # want to just silence exception. It's better to fail fast.
        if not isinstance(ident, str):
            raise TypeError("expected str, but got {!r}".format(type(ident)))
    
        # Quick test - if string is in keyword list, it's definitely not an ident.
        if keyword.iskeyword(ident):
            return False
    
        readline = lambda g=(lambda: (yield ident))(): next(g)
        tokens = list(tokenize.generate_tokens(readline))
    
        # You should get exactly 2 tokens
        if len(tokens) != 2:
            return False
    
        # First is NAME, identifier.
        if tokens[0][0] != tokenize.NAME:
            return False
    
        # Name should span all the string, so there would be no whitespace.
        if ident != tokens[0][1]:
            return False
    
        # Second is ENDMARKER, ending stream
        if tokens[1][0] != tokenize.ENDMARKER:
            return False
    
        return True
    

    The same function, but compatible with Python 3, looks like this:

    import keyword
    import tokenize
    
    def isidentifier_py3(ident):
        """Determines if string is valid Python identifier."""
    
        # Smoke test — if it's not string, then it's not identifier, but we don't
        # want to just silence exception. It's better to fail fast.
        if not isinstance(ident, str):
            raise TypeError("expected str, but got {!r}".format(type(ident)))
    
        # Quick test — if string is in keyword list, it's definitely not an ident.
        if keyword.iskeyword(ident):
            return False
    
        readline = lambda g=(lambda: (yield ident.encode('utf-8-sig')))(): next(g)
        tokens = list(tokenize.tokenize(readline))
    
        # You should get exactly 3 tokens
        if len(tokens) != 3:
            return False
    
        # If using Python 3, first one is ENCODING, it's always utf-8 because 
        # we explicitly passed in UTF-8 BOM with ident.
        if tokens[0].type != tokenize.ENCODING:
            return False
    
        # Second is NAME, identifier.
        if tokens[1].type != tokenize.NAME:
            return False
    
        # Name should span all the string, so there would be no whitespace.
        if ident != tokens[1].string:
            return False
    
        # Third is ENDMARKER, ending stream
        if tokens[2].type != tokenize.ENDMARKER:
            return False
    
        return True
    

    However, be aware of bugs in Python 3 tokenize implementation that reject some completely valid identifiers like ℘᧚, and 贈ᩭ. ast works fine though. Generally, I'd advise against using tokenize-based implemetation for actual checks.

    Also, some may consider heavy machinery like AST parser to be a tad overkill. This simple implementation is self-contained and guaranteed to work on any Python 2:

    import keyword
    import string
    
    def isidentifier(ident):
        """Determines if string is valid Python identifier."""
    
        if not isinstance(ident, str):
            raise TypeError("expected str, but got {!r}".format(type(ident)))
    
        if not ident:
            return False
    
        if keyword.iskeyword(ident):
            return False
    
        first = '_' + string.lowercase + string.uppercase
        if ident[0] not in first:
            return False
    
        other = first + string.digits
        for ch in ident[1:]:
            if ch not in other:
                return False
    
        return True
    

    Here are few tests to check these all work:

    assert isidentifier('foo')
    assert isidentifier('foo1_23')
    assert not isidentifier('pass')    # syntactically correct keyword
    assert not isidentifier('foo ')    # trailing whitespace
    assert not isidentifier(' foo')    # leading whitespace
    assert not isidentifier('1234')    # number
    assert not isidentifier('1234abc') # number and letters
    assert not isidentifier('👻')      # Unicode not from allowed range
    assert not isidentifier('')        # empty string
    assert not isidentifier('   ')     # whitespace only
    assert not isidentifier('foo bar') # several tokens
    assert not isidentifier('no-dashed-names-for-you') # no such thing in Python
    
    # Unicode identifiers are only allowed in Python 3:
    assert isidentifier('℘᧚') # Unicode $Other_ID_Start and $Other_ID_Continue
    

    Performance

    All measurements have been conducted on my machine (MBPr Mid 2014) on the same randomly generated test set of 1 500 000 elements, 1 000 000 valid and 500 000 invalid. YMMV

    == Python 3:
    method | calls/sec | faster
    ---------------------------
    token  |    48 286 |  1.00x
    ast    |   175 530 |  3.64x
    native | 1 924 680 | 39.86x
    
    == Python 2:
    method | calls/sec | faster
    ---------------------------
    token  |    83 994 |  1.00x
    ast    |   208 206 |  2.48x
    simple | 1 066 461 | 12.70x