pythondictionary

Search a file ordered alphabetically in Python is slow


I have 2 text files, both ordered alphabetically.

wordlist.txt, which contains a list of unique words:

(word) 
a
ad
and
at

dictionary.txt, which contains a list of non unique words followed by a tab and a definition:

(word)  (definition)
and congiunzione
at  abbreviazione
at  avverbio

For each word in wordlist.txt I need to traverse dictionary.txt until I find the first match, and gather the corresponding definition, and the subsequent ones, if present. Once gathered, I break the search cicle as it would be useless to traverse the dictionary.txt further.

I proceed for the next entry in wordlist.txt and so on.

This is an extract of my code:

for wordtosearch in open("wordlist.txt", "r"):   
    found = 0
    isfound = False
    
    for dictionaryentry in open("dictionary.txt", "r"):
        dictionaryelements = dictionaryentry.split("\t") #split the word and the definition
        
        if wordtosearch == dictionaryelements[0]:
            # ... here I gather the definition and I concatenate it to the previous one
            found += 1   #at least 1 entry is found
            isfound = True
        else:
            isfound = False
            
        #if we don't have a match in the current cicle but we've had at least un match before, it means we can stop search further
        if found > 0 and isfound == False:
            break

As you can see, for every wordtosearch I need to traverse the dictionary until the word is found. This takes a lot of time, as both the wordlist and the dictionary have hundreds of entries, and even if I omitted it, actually I need to search five different dictionaries.

I thought about saving the line number where the previous word matched, so that the next word search will begin from that line of dictionary.txt instead of starting from the beginning. If no match is found for the previous word, I'll use the previous one to that and so on.

Would that be a good solution? Or Python offers something better which I don't know? By the way I'm not limited to Python if you know something better, but I'm limited to Windows.


Solution

  • Use a relational database already, and do the JOIN you described, perhaps using the SQLite that is part of python’s standard libraries. It lets you efficiently process a pair of input files of unlimited size, even on a small memory VM or desktop. If you really want everything RAM resident, there's always .merge().

    Or go old school, and preprocess with /usr/bin/join.

    https://linux.die.net/man/1/join