I am trying to parse many files found in a directory, however using multiprocessing slows my program.
# Calling my parsing function from Client. L = getParsedFiles('/home/tony/Lab/slicedFiles') <--- 1000 .txt files found here. combined ~100MB
Following this example from python documentation:
from multiprocessing import Pool def f(x): return x*x if __name__ == '__main__': p = Pool(5) print(p.map(f, [1, 2, 3]))
I've written this piece of code:
from multiprocessing import Pool from api.ttypes import * import gc import os def _parse(pathToFile): myList =  with open(pathToFile) as f: for line in f: s = line.split() x, y = [int(v) for v in s] obj = CoresetPoint(x, y) gc.disable() myList.append(obj) gc.enable() return Points(myList) def getParsedFiles(pathToFile): myList =  p = Pool(2) for filename in os.listdir(pathToFile): if filename.endswith(".txt"): myList.append(filename) return p.map(_pars, , myList)
I followed the example, put all the names of the files that end with a
.txt in a list, then created Pools, and mapped them to my function. Then I want to return a list of objects. Each object holds the parsed data of a file. However it amazes me that I got the following results:
#Pool 32 ---> ~162(s) #Pool 16 ---> ~150(s) #Pool 12 ---> ~142(s) #Pool 2 ---> ~130(s)
62.8 GiB RAM Intel® Core™ i7-6850K CPU @ 3.60GHz × 12
What am I missing here ?
Thanks in advance!
Looks like you're I/O bound:
In computer science, I/O bound refers to a condition in which the time it takes to complete a computation is determined principally by the period spent waiting for input/output operations to be completed. This is the opposite of a task being CPU bound. This circumstance arises when the rate at which data is requested is slower than the rate it is consumed or, in other words, more time is spent requesting data than processing it.
You probably need to have your main thread do the reading and add the data to the pool when a subprocess becomes available. This will be different to using
As you are processing a line at a time, and the inputs are split, you can use
fileinput to iterate over lines of multiple files, and map to a function processing lines instead of files:
Passing one line at a time might be too slow, so we can ask map to pass chunks, and can adjust until we find a sweet-spot. Our function parses chunks of lines:
def _parse_coreset_points(lines): return Points([_parse_coreset_point(line) for line in lines]) def _parse_coreset_point(line): s = line.split() x, y = [int(v) for v in s] return CoresetPoint(x, y)
And our main function:
import fileinput def getParsedFiles(directory): pool = Pool(2) txts = [filename for filename in os.listdir(directory): if filename.endswith(".txt")] return pool.imap(_parse_coreset_points, fileinput.input(txts), chunksize=100)