I have a very large text file, and a function that does what I want it to do to each line. However, when reading line by line and applying the function, it takes roughly three hours. I'm wondering if there isn't a way to speed this up with chunking or multiprocessing.
My code looks like this:
with open('f.txt', 'r') as f:
function(f,w)
Where the function takes in the large text file and an empty text file and applies the function and writes to the empty file.
I have tried:
def multiprocess(f,w):
cores = multiprocessing.cpu_count()
with Pool(cores) as p:
pieces = p.map(function,f,w)
f.close()
w.close()
multiprocess(f,w)
But when I do this, I get a TypeError <= unsupported operand with type 'io.TextWrapper' and 'int'. This could also be the wrong approach, or I may be doing this wrong entirely. Any advice would be much appreciated.
even if you can successfully pass open file objects to child OS processes in your Pool as arguments f
and w
(which I don't think you can on any OS) trying to read from and write to files concurrently is a bad idea, to say the least.
In general, I recommend using the Process class rather than Pool, assuming that the output end result needs to maintain the same order as the input 20m lines file.
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process
f.readlines()
, if your entire dataset can fit in memory, comfortablydel
the original big list right after to free some RAMsubprocess.run('cat out_file* > big_output.txt')
if you are running a UNIX system, or the equivalent Windows command for windows.Queue
classhttps://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue
cores
(say 8)10_000 * 8
lines from your input 20m-rows filesubprocess.run('cat out_file* > big_output.txt')
if you are running a UNIX system, or the equivalent Windows command for windows.Convoluted? well, it is usually a trade-off between Speed, RAM, Complexity. Also for a 20m row task, one needs to make sure that data processing is as optimal as possible - inline as much functions as you can, avoid alot of math, use Pandas / numpy in child processes if possible, etc.