python algorithm search a-star discrete-optimization

Spelling Bee optimal word set search

The NYT Spelling Bee game is a word game, where given 7 letters including 1 "center letter", you find words at least 4 letters long that must use the "center letter". The scoring is as follows: 4-letter words get 1 point, >4-letter words get their length in points, and pangrams (using all 7 letters) gets +7 points.

Often when I play, I think about what set of letters would generate the maximum score. This reddit post takes a greedy backwards stepwise approach, starting with the letter set of all letters, then greedily tries removing one letter and rescoring the subset. I think the end result is near if not optimal, but I wanted to know if there was a way to find the optimal solution.

My idea was modifying the method to use uniform-cost search (essentially Dijkstra's algorithm), so starting with the node representing the full letter set and exploring the neighbors that remove one letter. The max priority queue will always consider the neighbor in the frontier with the highest score, and the search ends when reaching the first node with 7 letters in the letter set.

The issue with this approach is that many intermediate nodes have the same score, so it ends up searching a huge amount of the 2^26 possible nodes anyway. I don't know of an admissible heuristic with A* that would help.

One trick I came up with to test if a word satisfies a given letter set is to preprocess words as letter bitmasks, where the i-th bit is set if the word contains the i-th letter of the alphabet. Then if word | mask == mask, the word only uses the letters in the mask.

I wrote two sample implementations in python/pypy and C++, but neither finished within an hour. So how can this be solved efficiently?

The sample implementation right now doesn't consider pangrams or the center letter (which may cut down on some searching).

The word list can be generated with grep -P "^[a-z]{4,}$" /usr/share/dict/words > words.txt.

from string import ascii_lowercase

S = ascii_lowercase  # alphabet Σ

with open("words.txt") as f:
    L0 = f.read().splitlines()  # language

# filter out words with more than 7 distinct letters
L = list(filter(lambda w: len(set(w)) <= 7, L0))

print(len(L0), len(L))
print(L[:20])

# for fun, make words into bitsets of letters: every word is a 26-bit integer
# the game letterset is a 26-bit mask 
bitsets = [0] * len(L)
for i in range(len(L)):
    for c in L[i]:
        bitsets[i] |= 1 << (ord(c) - ord('a'))
    
    if i < 20:
        print(L[i].ljust(16), bin(bitsets[i])[2:].zfill(26))

# simple counting for now
def word_score(word):
    if len(word) == 4:
        return 1
    else:
        return len(word)

def mask_score(mask):
    s = 0
    for i in range(len(L)):
        if mask | bitsets[i] == mask:
            s += word_score(L[i])
    return -s 


from heapq import heappop, heappush

# Graph search with Dijkstra's algorithm (but modified to calculate node weights)
start_mask = (1 << len(S)) - 1 
start_node = (mask_score(start_mask), start_mask)

pq = [start_node]
visited = set()

while True:
    score, mask = heappop(pq)

    print(score, "".join(S[i] if mask & (1 << i) else ' ' for i in range(26)))

    if mask.bit_count() == 7: 
        break

    # Add unvisited neighbors to queue
    for i in range(len(S)):
        if mask & (1 << i):
            new_mask = mask & ~ (1 << i)
            if new_mask in visited:
                continue
            new_score = mask_score(new_mask)
            new_node = (new_score, new_mask)
            heappush(pq, new_node)
            visited.add(new_mask)

Solution

Optimise your wordlist

Your mask_score function is doing a lot of redundant work. In particular, the scoring of each word does not need to be done each time mask_score is called, and can be instead stored alongside each word. Additionally, instead of looping through each word, you can loop through the unique bitmasks of words, which cuts down your wordlist L to about half (using the wamerican wordlist I have 20818 bitmasks with 40940 words). This also helps with the precomputed scoring: we can calculate a score for each word bitmask which is the sum of the scores of all words that have that bitmask.

For example, consider abalones and seasonable. Currently, mask_score will loop through each of these words, call word_score on them, and add up the totals. We can change this to instead store a dictionary of (bitmask, score) pairs, such that we have an entry (288787, 18) which accounts for both abalones and seasonable (288787 is the bitmask for the letter set {a, b, e, l, n, o, s})

With these optimisations, it now becomes feasible to loop through each of the 26 choose 7 = 657800 sets of starting letters to find the one with the maximal score. On my machine, the following python implementation of mask_score can run 1000 times in about 1.4 seconds.

from string import ascii_lowercase

S = ascii_lowercase  # alphabet Σ

with open("words.txt") as f:
    L0 = f.read().splitlines()  # language

# filter out words with more than 7 distinct letters
L = list(filter(lambda w: len(set(w)) <= 7, L0))

# for fun, make words into bitsets of letters: every word is a 26-bit integer
# the game letterset is a 26-bit mask
bitsets = [0] * len(L)
for i in range(len(L)):
    for c in L[i]:
        bitsets[i] |= 1 << (ord(c) - ord('a'))

# simple counting for now
def word_score(word):
    if len(word) == 4:
        return 1
    else:
        return len(word)

# L1 is a mapping of bitmask -> sum of scores of words with bitmask
L1 = dict()
for i in range(len(L)):
    L1.setdefault(bitsets[i], 0)
    L1[bitsets[i]] += word_score(L[i])

print(f"{len(L1)=}, {len(L)=}, {len(L0)=}")

def mask_score(mask):
    s = 0
    for i,j in L1.items():
        if mask | i == mask:
            s += j
    return -s

import timeit
print(timeit.timeit("mask_score(0b11111111111111111111111111)", number=1000, globals=globals()))
# 1.376941168000485

With pypy or a C++ implementation, the computation is likely faster as well.