I am using a Multiprocessing pool to train machine learners.
Each LearnerRun object gets a learner, a dictionary of hyperparameters, a name, some more options in an other options dictionary, the name of a directory to write results to, a set of IDs of examples to train on (a slice or numpy array), and a set of IDs of examples to test on (also a slice or numpy array). Importantly, the training and testing data are not read yet: The sets of IDs are relatively small and direct a later function's database-reading behavior.
I call self.pool.apply_async(learner_run.run)
, which formerly worked fine. Now the pool seems to be loaded up, but a print statement at the top of the run() function is never printed, so the processes are not actually getting run.
I've tracked down some other threads about this and found that I can see the problem in more detail with handler = self.pool.apply_async(learner_run.run)
followed by handler.get()
. This prints "SystemError: NULL result without error in PyObject_Call".
Great, something I can Google. But all I can find on this issue with Multiprocessing is that it can be caused when passing arguments that are too big to pickle to the subprocess. But, I am obviously passing no arguments to my subprocess. So what gives?
What else, aside from arguments exceeding the allotted memory size--which I am reasonably sure is not the problem here--can cause apply_async to give a null result?
Again, this worked before I left for vacation and hasn't been changed. What kinds of changes to other code might cause this to stop working?
If I do not try to get()
from the handler so execution doesn't stop on errors, the memory usage follows this strange pattern.
Okay, I found the problem. In fact, my LearnerRun was too large for Multiprocessing to handle. But the way in which it was is pretty subtle, so I'll describe.
Evidently it is not just the arguments that need to be pickled; the function is pickled too, including the LearnerRun
object its execution will rely on (the self
).
LearnerRun's constructor takes all the things in the options dictionary passed to it and uses setattr to turn all the keys and values in to member variables with values. This alone is fine, but my coworker realized that this left a couple of strings that will need to be database references and set self.trainDatabase = LarData(self.trainDatabase)
and self.coverageDatabase = LarData(self.coverageDatabase)
, which ordinarily would be fine.
Except this means that to pickle the class you have to pickle the entirety of the databases! I discovered this during a sanity check wherein just serialized the LearnerRun itself to see what would happen with pickle.dumps(learner_run)
. My memory was flooded, and the swap began filling up alarmingly quickly until stackoverflow.
So what about pickling to the disk? pickle.dump(learner_run, filename)
also blew up. It got to 14.3 GiB before I terminated!
What about removing those references and calling the LarData constructor later when needed? Bam. Fixed. Everything works. Multiprocessing doesn't give a mysterious SystemError anymore.
This is the second time pickle has caused me major pain recently.