pythonfor-loopmultiprocessing

Using for loop reading with multiprocessing missing iterables


Sorry if I'm wording this wrong, below is my script, I'm trying to figure out why when I review the archive file (that I created) I only see 9874 lines when the file to open/read has 10000. I guess I'm trying to uderstand why some iterations are missing. I've tried it a few times and that number always varies. What am I doing wrong?

import multiprocessing
import hashlib
from tqdm import tqdm

archive = open('color_archive.txt', 'w')

def generate_hash(yellow: str) -> str:
    b256 = hashlib.sha256(yellow.encode()).hexdigest()
    x = ' '.join([yellow, b256])
    archive.write(f"{x}\n")

if __name__ == "__main__":
    listofcolors = []   
    with open('x.txt') as f:
        for yellow in tqdm(f, desc="Generating..."):
            listofcolors.append(yellow.strip())
           
    cpustotal = cpu_count() - 1
    pool = multiprocessing.Pool(cpustotal)
    results = pool.imap(generate_hash, listofcolors)
    pool.close()
    pool.join()
print('DONE')

This script executes fine however when looking at the archive file some lines are missing for example a file with 10000 lines only wrote 9985 lines to the new file, what am I doing wrong?


Solution

  • Here's another way to think about the problem. Each process does its work and returns the value to the main process, which writes to the file. This is like doing a "Queue" without explicitly using a queue.

    import multiprocessing
    import hashlib
    from tqdm import tqdm
    
    def generate_hash(yellow: str) -> str:
        b256 = hashlib.sha256(yellow.encode()).hexdigest()
        return yellow + " " + b256 + "\n"    
    
    def main():
        archive = open('color_archive.txt', 'w')
        with open('x.txt') as f:
            listofcolors = [s.strip() for s in f]
               
        cpustotal = multiprocessing.cpu_count() - 1
        pool = multiprocessing.Pool(cpustotal)
        for s in pool.imap(generate_hash, listofcolors):
            archive.write(s)
    
        pool.close()
        pool.join()
    
    if __name__ == "__main__":
        main()
        print('DONE')