pythondata-structures

Optimal implementation of a bijective mapping on a large scale


Here's an interesting problem: given a large amount of text data (~5 GB of words as strings), I need to build a mapping such that every word is associated with a unique integer. It should be noted that it needs to work the other way - every integer should also be associated with a unique word (hence why its a bijective mapping).

I also need to be able to quickly look up a word by its associated number.

The following is the most naive implementation I can think of:

   data_structure = []
   for word in giant_list_of_words:
      if (word not in data_structure):
         data_structure.append(word)
   return data_structure

   def lookup(data_structure, i):
       return data_structure[i]

With this approach, the mapping is simply words to their index in the list. Building the mapping is slow, but lookup is fast.

Here's another approach:

def mapping():
   data_structure = {}
   count = 0
   for word in giant_list_of_words:
      if (word not in data_structure):
         data_structure[word] = count
         count += 1
   return data_structure

def lookup(data_structure, i):
   retval = ''
   for key in data_structure:
      if (data_structure[key] == i):
          retval = key
          break
   return retval

This gets built fast, but is slow to index. Any thoughts?


Solution

  • I think it's rare that there is an absolutely optimal way to solve a data-structure design problem in Python, but for this question there is a good candidate.

    Each and every distinct object in Python, including strings, has a number id(obj) which is unique, and never changes for the lifetime of the object.

    It happens that the _ctypes module has a function named PyObj_FromPtr which looks up an object by its id:

    >>> word = 'supercalifragilisticexpialadocious'
    >>> word_id = id(word)
    >>> word_id
    139817888649440
    >>> from _ctypes import PyObj_FromPtr
    >>> PyObj_FromPtr(word_id)
    'supercalifragilisticexpialadocious'
    

    This is all built into the language - Python assigns these ids to your objects whether you need them or not, and the lookup is fast because (as a CPython implementation detail) the id of an object is its memory address. So it's hard to imagine there is any more efficient solution to this problem.