pythonduplicatesword-list

Deleting duplicated words in a very large words list


I'm a beginner at this and I wrote a program that generates a wordlist following specific algorithms. The problem is it makes duplications.

So I'm looking for a way to make the code iterates through the range given or the number of words given to make without duplicating words.

OR write another program that goes through the words list the first program made and delete any duplicated words in that file which is going to take time but is worth it.

The words that should be generated should be like this one X4K7GB9y, 8 characters in length, following the rule [A-Z][0-9][A-Z][0-9][A-Z][A-Z][0-9][a-z], and the code is this:

import random
import string

random.seed(0)
NUM_WORDS = 100000000

with open("wordlist.txt", "w", encoding="utf-8") as ofile:     
    for _ in range(NUM_WORDS): 
        uppc = random.sample(string.ascii_uppercase, k=4)
        lowc = random.sample(string.ascii_lowercase, k=1) 
        digi = random.sample(string.digits, k=3) 
        word = uppc[0] + digi[0] + uppc[1] + digi[1] + uppc[2] + uppc[3] + digi[2] + lowc[0] 
        print(word, file=ofile)

I'll appreciate it if you can modify the code to not make duplications or write another code that checks the wordlist for duplications and deletes them. Thank you so much in advance


Solution

  • You can prevent duplicate words from the get go by remembering what you created and not write it again.

    This needs a bit of memory to hold 100.000.000 8 letter words - you can lessen that by only remembering the hashes of words. You will miss out on some hash collisions, but with about 26**5 * 10**3 = 11,881,376,000 possible combinations you should be fine.

    import random
    import string
    
    random.seed(0)
    NUM_WORDS = 100 # reduced for testing purposes
    found = 0
    words = set()
    with open("wordlist.txt", "w", encoding="utf-8") as ofile:     
        while found < NUM_WORDS: 
            # get 5 upper case letters, use the 5h as .lower()
            l = random.sample(string.ascii_uppercase, k=5) 
            d = random.sample(string.digits, k=3) 
            word = l[0] + d[0] + l[1] + d[1] + l[2] + l[3] + d[2] + l[4].lower()
            if hash(word) in words:
                continue
            print(word, file=ofile)
            words.add(hash(word))
            found += 1