pythonstringpython-internalsinternalsstring-interning

Python string interning


While this question doesn't have any real use in practice, I am curious as to how Python does string interning. I have noticed the following.

>>> "string" is "string"
True

This is as I expected.

You can also do this.

>>> "strin"+"g" is "string"
True

And that's pretty clever!

But you can't do this.

>>> s1 = "strin"
>>> s2 = "string"
>>> s1+"g" is s2
False

Why wouldn't Python evaluate s1+"g", and realize it is the same as s2 and point it to the same address? What is actually going on in that last block to have it return False?


Solution

  • This is implementation-specific, but your interpreter is probably interning compile-time constants but not the results of run-time expressions.

    In what follows CPython 3.9.0+ is used.

    In the second example, the expression "strin"+"g" is evaluated at compile time, and is replaced with "string". This makes the first two examples behave the same.

    If we examine the bytecodes, we'll see that they are exactly the same:

      # s1 = "string"
      1           0 LOAD_CONST               0 ('string')
                  2 STORE_NAME               0 (s1)
    
      # s2 = "strin" + "g"
      2           4 LOAD_CONST               0 ('string')
                  6 STORE_NAME               1 (s2)
    

    This bytecode was obtained with (which prints a few more lines after the above):

    import dis
    
    source = 's1 = "string"\ns2 = "strin" + "g"'
    code = compile(source, '', 'exec')
    print(dis.dis(code))
    

    The third example involves a run-time concatenation, the result of which is not automatically interned:

      # s3a = "strin"
      3           8 LOAD_CONST               1 ('strin')
                 10 STORE_NAME               2 (s3a)
    
      # s3 = s3a + "g"
      4          12 LOAD_NAME                2 (s3a)
                 14 LOAD_CONST               2 ('g')
                 16 BINARY_ADD
                 18 STORE_NAME               3 (s3)
                 20 LOAD_CONST               3 (None)
                 22 RETURN_VALUE
    

    This bytecode was obtained with (which prints a few more lines before the above, and those lines are exactly as in the first block of bytecodes given above):

    import dis
    
    source = (
        's1 = "string"\n'
        's2 = "strin" + "g"\n'
        's3a = "strin"\n'
        's3 = s3a + "g"')
    code = compile(source, '', 'exec')
    print(dis.dis(code))
    

    If you were to manually sys.intern() the result of the third expression, you'd get the same object as before:

    >>> import sys
    >>> s3a = "strin"
    >>> s3 = s3a + "g"
    >>> s3 is "string"
    False
    >>> sys.intern(s3) is "string"
    True
    

    Also, Python 3.9 prints a warning for the last two statements above:

    SyntaxWarning: "is" with a literal. Did you mean "=="?