pythondictionarysethashsetdawg

Set vs DAWG for checking membership in dictionary in Python


I need to be able to quickly check if a given word is in my dictionary (English wordlist). I'm only concerned with speed of checking membership (not adding or removing elements), and memory use isn't really a problem.

Originally I was using a set like this:

words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())
if(word in words):
    ...

My program took approx. 4s to run on a test input. I then tried to optimise things by using a DAWG (http://pypi.python.org/pypi/pyDAWG) instead by precomputing the DAWG and pickling it:

words = pickle.load(open('wordlistDAWG.pyd'))
if(words.word2index(word) is not None):
    ...

On the same test input, the program then took around 40s to run (including a few seconds to load the DAWG which I don't care about). I was hoping using the DAWG would make things run quicker!

Maybe I'm missing some understanding about how python hashes things - is a set already the best I'm going to get (O(1) membership test?) rather than a DAWG or a Trie? Will a DAWG just save on memory but not on computation?

Many Thanks!


Solution

  • I think DAWG doesn't save you CPU cycles if you use it as a set replacement.

    Set lookup is O(1) regarding set size, and DAWG lookup is also O(1) regarding DAWG items count. DAWG lookup is O(N) regarding lookup key length (there are len(key) steps needed to check if key is in DAWG when key is in DAWG). Set lookup is also O(N) regarding key length (because the hash of a key must be calculated). So this boils down to implementation, and

    DAWG may have an advantage when item is not in DAWG because it would require less than len(key) steps to check this, and to calculate hash len(key) steps are always needed (well, if hash value is not cached). But it is hard to beat built-in set even in this case.

    A shameless plug - you can also give https://pypi.python.org/pypi/DAWG a try - but __contains__ is still about 2x slower than dict's.

    By the way, pyDAWG Python version of word2index does many dict lookups internally so it can't be faster than a single set lookup.