pythonloopsoptimizationexecution-time

How can i optimize checking permutations of char (10 letters) against a list of words?


I am trying to make a game from a TV show Cifras y letras in Spain (Countdown in the UK or Des Chiffres et des lettres in France). I have a list of all the Spanish words thanks to a code I found that allowed me to scrape an official site. With that I decided to make a dict with all the letters and how common they were, to then generate a random (weighted) 10 char string from which the player would try to make the longest possible word. The problem arrives when I get to the point of calculating the longest possible word with that 10-letter string.

Here letters (random_string when calling) is the 10-character string generated by me, and word_list (filtered_word) is the list with all the words in Spanish with 10 letters or less.

def find_longest_word(letters, word_list):
    # Sort word_list by length (longest to shortest)
    word_list.sort(key=len, reverse=True)
    
    # Convert word_list to a set for fast lookup
    word_set = set(word_list)
    
    longest_word = ''
    max_length = 0
    
    # Iterate over lengths from len(letters) down to 1
    for length in range(len(letters), 0, -1):
        # Filter word_list to include only words of current length
        filtered_words = [word for word in word_list if len(word) == length]
        # Generate permutations of current length
        for perm in permutations(letters, length): #permutations() given letter set and length gives all permutations
            # The outer for loop will make it so that we start by 10 letter words and after obtaining all permutations of that length we go to 9
            perm_word = ''.join(perm)
            if perm_word in filtered_words:
               if length > max_length:
                    max_length = length
                    longest_word = perm_word
                    # Exit the loop once the longest word is found
                    return longest_word
    
    return longest_word
longest_word = find_longest_word(random_string, filtered_words)

This has been executing 1h and still going, I understand that the way I am doing things there are 10! (3.6M) permutations to be checked against 81.000 words in Spanish (albeit less than that since I compare the 10-letter string to only words that have the same length (I'm not sure if this changes anything since it still has to parse through every word to check its length)).

https://github.com/Albertofma/Letras is the whole project in case anyone wants to see the rest of the code.


Solution

  • A couple of things algorithm-wise can help you....

    If you stick with the code you have, you should make filtered_words a set. You are doing 3.6M lookups into a list with mehhh 80K elements, that is going to be slow. And if that fails, you're going to do another 3M plus lookups for P(10,9), etc.

    Consider doing "the work" on the words before hand (as shown below). If you sort the letters in each of the ~80K words that are testable (which is extremely quick for small lists), then you can avoid permuting the sample letters and you only have a couple hundred (at most) comparisons to do because you can use combinations of letters, sorted. And sum(C(10,10)=1 , C(10,9) = 10 , C(10, 8) = 45, ... ) is TINY as compared with the sum of permutations.

    This mod of your code runs in less than a second:

    import random #To make the random string given to the player
    
    from collections import Counter, defaultdict
    from itertools import permutations, \
        combinations  # For obtaining all permutations for a 10 letter set
    # Define the file path
    file_path = 'palabras.txt'
    
    #Dictionary with accented letters and their unaccented counterparts to be replaced by
    accented_to_unaccented = {
        'é': 'e', 'í': 'i', 'ú': 'u', 'á': 'a', 'ó': 'o', 'ü': 'u', 'è': 'e'
    }
    
    # Initialize an empty list to store the filtered words
    filtered_words = defaultdict(list)
    
    #Function to replace accents in how common the letters are
    def replace_accented_letters(word):
        return ''.join(accented_to_unaccented.get(char, char) for char in word) #This will parse all the words in the file and replace the first char in the dict (the accented one) with the solution
    
    # Open and read the file
    with open(file_path, 'r', encoding='utf-8') as file:
        # Read each line (word) in the file
        for line in file:
            if len(line) > 10:
                continue
            # Strip any leading/trailing whitespace, force lowercase and check the word length
            word = line.strip().lower()
            word = replace_accented_letters(word) #Call function to replace accents
    
            filtered_words[''.join(sorted(word))].append(word)
    
    
    def find_longest_word(letters, word_list: dict):
    
        longest_word = []
        
        # Iterate over lengths from len(letters) down to 1
        for length in range(len(letters), 0, -1):
            for sample in combinations(letters, length): # Filter word_list to include only words of
                sorted_letters = ''.join(sorted(sample))
                matches = word_list.get(sorted_letters, None)
                if matches:
                    return matches
    
        return None
    
    random_string = ''.join(random.sample('abcdefghijklmnopqrstuvwxyz', 10))
    #Use the longest word function
    longest_word = find_longest_word(random_string, filtered_words)
    
    print(f"La palabra mas larga de las letras random '{random_string}' posible era:  {longest_word}")