pythonlistsetmembershipanagram

Python- Possible English one-word anagrams given input letters


I know variations of this have been asked before, but I was unable to understand any of the previous implementations because most of them involved using sets and the issubset method.

Here is what I am trying to do: I have a set of words in a dictionary and a list of possible letters. I want to find if the members of the set can be formed through re-arranging the letters in the list. Here is my current implementation:

def solve(dictionary, letters):

    for word in dictionary: #for each word in the dictionary
        if len(word) > len(letters):   # first optimization, doesn't check words that are larger than letter set
            continue
        else: 
            scrambledword = "".join([b for b in sorted(list(word))]) #sorts the letters in each word
            if set(scrambledword).issubset(letters):
                print word


def main():
    dictionary = set([x.strip() for x in open("C:\\Python27\\dictionary.txt")])        
    letters = sorted(['v','r','o','o','m','a','b','c','d'])
    solve(dictionary, letters)


main()

The obvious problem with this implementation is that some words will be found that use more than one letter in "letters." For example, the word 'cardboard' appears as a valid word, despite there being only one copy of 'a' and 'r' in the letters list. How do I use the "issubset" method on lists?


Solution

  • To know if you can make a word out of a set of letters [oops, I did it myself -- I meant 'collection'!], you want every letter to occur at least the right number of times, so I think we're going to have to work the counts in there somehow. By definition, Python sets don't care about the number of elements in a source list. Maybe something like

    from collections import Counter
    
    letters = ['v','r','o','o','m','a','b','c','d']
    words = 'cardboard boom booom'.split()
    letterscount = Counter(letters)
    
    for word in words:
        wordcount = Counter(word)
        print word, all(letterscount[c] >= wordcount[c] for c in wordcount)
    

    giving

    cardboard False
    boom True
    booom False
    

    Counter is a handy utility class:

    >>> c = Counter(letters)
    >>> c
    Counter({'o': 2, 'a': 1, 'c': 1, 'b': 1, 'd': 1, 'm': 1, 'r': 1, 'v': 1})
    >>> c['o']
    2
    >>> c['z']
    0
    

    [DSM: the return! I removed a community edit which doesn't work because Counter instances aren't hashable.]

    If searching speed is a concern, then you can trade off memory and precomputation time:

    from collections import defaultdict, Counter
    from itertools import combinations
    
    # precomputations
    allwords = open('/usr/share/dict/words').read().split() 
    allwords = list(w for w in allwords if len(w) >= 3) # hack, /words contains lots of silliness
    allwords_by_count = defaultdict(list)
    for i, word in enumerate(allwords):
        allwords_by_count[frozenset(word)].append((word, Counter(word)))
        if i % 1000 == 0:
            print i, word
    
    
    def wordsfrom(letters, words_by_count):
        lettercount = Counter(letters)
        for subsetsize in range(1, len(lettercount)+1):
            for subset in combinations(lettercount, subsetsize):
                for possword, posswordcount in words_by_count[frozenset(subset)]:
                    if all(posswordcount[c] <= lettercount[c] for c in posswordcount):
                        yield possword
    
    >>> wordsfrom('thistles', allwords_by_count)
    <generator object wordsfrom at 0x1032956e0>
    >>> list(wordsfrom('thistles', allwords_by_count))
    ['ess', 'sis', 'tit', 'tst', 'hei', 'hie', 'lei', 'lie', 'sie', 'sise', 'tie', 'tite', 'she', 'het', 'teth', 'the', 'els', 'less', 'elt', 'let', 'telt', 'set', 'sett', 'stet', 'test', 'his', 'hiss', 'shi', 'sish', 'hit', 'lis', 'liss', 'sil', 'lit', 'til', 'tilt', 'ist', 'its', 'sist', 'sit', 'shies', 'tithe', 'isle', 'sile', 'sisel', 'lite', 'teil', 'teli', 'tile', 'title', 'seit', 'sesti', 'site', 'stite', 'testis', 'hest', 'seth', 'lest', 'selt', 'lish', 'slish', 'hilt', 'lith', 'tilth', 'hist', 'sith', 'stith', 'this', 'list', 'silt', 'slit', 'stilt', 'liesh', 'shiel', 'lithe', 'shiest', 'sithe', 'theist', 'thesis', 'islet', 'istle', 'sistle', 'slite', 'stile', 'stilet', 'hitless', 'tehsil', 'thistle']
    

    [Heh. I just noticed that 'thistles' itself isn't in the list, but that's because it's not in the words file..]

    And yes, the apparent "nonwords" there are really in the words file:

    >>> assert all(w in allwords for w in (wordsfrom('thistles', allwords_by_count)))
    >>>