pythonmultiprocessingchunking

Read in large text file (~20m rows), apply function to rows, write to new text file


I have a very large text file, and a function that does what I want it to do to each line. However, when reading line by line and applying the function, it takes roughly three hours. I'm wondering if there isn't a way to speed this up with chunking or multiprocessing.

My code looks like this:

with open('f.txt', 'r') as f:
    function(f,w)

Where the function takes in the large text file and an empty text file and applies the function and writes to the empty file.

I have tried:

def multiprocess(f,w):    
    cores = multiprocessing.cpu_count()

    with Pool(cores) as p:
        pieces = p.map(function,f,w)
    
    f.close()
    w.close()

multiprocess(f,w)

But when I do this, I get a TypeError <= unsupported operand with type 'io.TextWrapper' and 'int'. This could also be the wrong approach, or I may be doing this wrong entirely. Any advice would be much appreciated.


Solution

  • even if you can successfully pass open file objects to child OS processes in your Pool as arguments f and w (which I don't think you can on any OS) trying to read from and write to files concurrently is a bad idea, to say the least.

    In general, I recommend using the Process class rather than Pool, assuming that the output end result needs to maintain the same order as the input 20m lines file.

    https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process

    The slowest solution, but most efficient RAM usage

    For maximum speed, but most RAM consumption

    for an intermediate trade-off between speed and RAM, but the most complex, we will have to use the Queue class

    https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue

    Convoluted? well, it is usually a trade-off between Speed, RAM, Complexity. Also for a 20m row task, one needs to make sure that data processing is as optimal as possible - inline as much functions as you can, avoid alot of math, use Pandas / numpy in child processes if possible, etc.