Here's an interesting problem: given a large amount of text data (~5 GB of words as strings), I need to build a mapping such that every word is associated with a unique integer. It should be noted that it needs to work the other way - every integer should also be associated with a unique word (hence why its a bijective mapping).
I also need to be able to quickly look up a word by its associated number.
The following is the most naive implementation I can think of:
data_structure = []
for word in giant_list_of_words:
if (word not in data_structure):
data_structure.append(word)
return data_structure
def lookup(data_structure, i):
return data_structure[i]
With this approach, the mapping is simply words to their index in the list. Building the mapping is slow, but lookup is fast.
Here's another approach:
def mapping():
data_structure = {}
count = 0
for word in giant_list_of_words:
if (word not in data_structure):
data_structure[word] = count
count += 1
return data_structure
def lookup(data_structure, i):
retval = ''
for key in data_structure:
if (data_structure[key] == i):
retval = key
break
return retval
This gets built fast, but is slow to index. Any thoughts?
I think it's rare that there is an absolutely optimal way to solve a data-structure design problem in Python, but for this question there is a good candidate.
Each and every distinct object in Python, including strings, has a number id(obj)
which is unique, and never changes for the lifetime of the object.
It happens that the _ctypes
module has a function named PyObj_FromPtr
which looks up an object by its id
:
>>> word = 'supercalifragilisticexpialadocious'
>>> word_id = id(word)
>>> word_id
139817888649440
>>> from _ctypes import PyObj_FromPtr
>>> PyObj_FromPtr(word_id)
'supercalifragilisticexpialadocious'
This is all built into the language - Python assigns these ids to your objects whether you need them or not, and the lookup is fast because (as a CPython implementation detail) the id of an object is its memory address. So it's hard to imagine there is any more efficient solution to this problem.