I'm trying to thread my code for better performance, using the multiprocessing library's Process module.
The skeleton of code is to create dictionaries for each thread that they work on, and after it's all done, the dictionaries are summed and saved to a file. The resources are created like:
histos = {}
for int i in range(number_of_threads):
histos[i] = {}
histos[i]['all'] = ROOT.TH1F objects
histos[i]['kinds_of'] = ROOT.TH1F objects
histos[i]['keys'] = ROOT.TH1F objects
Then in the Processes, each thread works with its own histos[thread_number] object, working on the contained ROOT.TH1Fs. However, my problem is that apparently if I start the threads with Process like this:
proc = {}
for i in range(Nthreads):
it0 = 0 + i * n_entries / Nthreads # just dividing up the workload
it1 = 0 + (i+1) * n_entries / Nthreads
proc[i] = Process(target=RecoAndRecoFix, args=(i, it0, it1, ch,histos))
# args: i is the thread id (index), it0 and it1 are indices for the workload,
# ch is a variable that is read-only, and histos is what we defined before,
# and the contained TH1Fs are what the threads put their output into.
# The RecoAndFix function works inside with histos[i], thus only accessing
# the ROOT.TH1F objects that are unique to it. Each thread works with its own histos[i] object.
proc[i].start()
then the threads do have access their histos[i] objects, but cannot write to them. To be precise, when I call Fill() on the TH1F histograms, no data is filled because it cannot write to the objects because they are not shared variables.
So here: https://docs.python.org/3/library/multiprocessing.html I've found that I should instead use multiprocessing.Array() to create an array that can be both read and written to by the threads, like this:
typecoder = {}
histos = Array(typecoder,number_of_threads)
for int i in range(number_of_threads):
histos[i] = {}
histos[i]['all'] = ROOT.TH1F objects
histos[i]['kinds_of'] = ROOT.TH1F objects
histos[i]['keys'] = ROOT.TH1F objects
However, it won't accept dictionary as a type. It will not work, it says TypeError: unhashable type: 'dict'
So what would be the best approach to solve this issue? What I need is to pass an instance of every "all kinds of keys" stored in dictionaries to each thread, so they work on their own. And they must be able to write these received resources.
Thanks for your help, and sorry if I'm overlooking something trivial, I did threaded code before, but not yet with python.
The missing piece is the distinction is between "process" and "thread"; you mix them in your post, but your approach will only work with threads, not with processes.
Threads all share memory; all of them will refer to the same dictionary, and can therefore use it to communicate with each other and with the parent.
Processes have separate memory; each will get its own copy of the dictionary. If they want to communicate, they have to communicate by other means (for example, using multiprocessing.Queue
). On the other hand, this means they get the safety of separation.
An additional complication in Python is "the GIL"; threads will mostly share the same Python interpreter serially, only running in parallel when doing I/O, accessing the network or with a few libraries that make special provision for it (numpy, image processing, a couple of others). Meanwhile, processes get full parallelism.