pythonpython-3.xdictionarystring-interningmemory-optimization

Is sys.intern() used for every look-up, or only when a string is created the first time? (Python Follow-up)


This is a follow-up to my previous question regarding string interning in Python, though I think it is unrelated enough to qualify as a separate question. In short, when using sys.intern, do I need to pass the string in question to the function upon most/every use, or do I only have to intern the string once and track its reference? To clarify with a pseudo-codeish use case where I do what I think is correct: (see comments)

# stores all words in sequence, 
# we want duplicate words too,
# but those should refer to the same string
# (the reason we want interning)
word_sequence = []
# simple word count dictionary
word_dictionary = {}
for line in text:
    for word in line: # using magic unspecified parsing/tokenizing logic
        # returns a canonical "reference"
        word_i = sys.intern(word)
        word_sequence.append(word_i)
        try:
            # do not need to intern again for
            # specific use as dictonary key,
            # or is something undesirable done
            # by the dictionary that would require 
            # another call here?
            word_dictionary[word_i] += 1 
        except KeyError:
            word_dictionary[word_i] = 1

# ...somewhere else in a function far away...
# Let's say that we want to use the word sequence list to
# access the dictionary (even the duplicates):
for word in word_sequence:
    # Do NOT need to re-sys.intern() word
    # because it is the same string object
    # interned previously?
    count = word_dictionary[word]
    print(count)

What if I want to access words in a different dictionary? Do I need to use sys.intern() again when inserting a key:value, even if the key has already been interned? May I have some clarification? Thank you in advance.


Solution

  • You have to use sys.intern() each time you have a new string object, otherwise you can't guarantee that you have the same object for the value represented.

    However, your word_seq list contains references to interned string objects. You don't have to use sys.intern() again on those. At no point is anything creating a copy of a string here (which would be unnecessary and wasteful).

    All sys.intern() does is map the string value to a specific object that has that value. As long you then keep a reference to the return value, you are guaranteed to still have access that one specific object.