I'm using the pystata package that allows me to run stata code from python, and send data from python to stata and back.
The way I understand this, is that there is a single stata instance that is running in the background. I want to bootstrap some code that wraps around the stata code, and I would like to run this in parallel.
Essentially, I would like to have something like
from joblib import Parallel, delayed
import pandas as pd
def single_instance(seed):
# initialize stata
from pystata import config, stata
config.init('be')
# run some stata code (load a data set and collapse, for example)
stata.run('some code')
# load stata data to python
df = stata.pdataframe_from_data()
out = do_something_with_data(df, seed)
return out
if __name__ == '__main__':
seeds = np.arange(1, 100)
Parallel(backend='loky', n_jobs=-1)(
delayed(single_instance)(seeds[i]) for i in values)
where there is some code that is run in parallel, and each thread is initializing its own stata instance in parallel. However, I'm worried that all these parallelized threads are accessing the same stata instance -- can this work as I expect? How should I set this up?
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/x/miniconda3/envs/stata/lib/python3.12/site-packages/joblib/externals/loky/process_executor.py", line 391, in _process_worker
call_item = call_queue.get(block=True, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/x/miniconda3/envs/stata/lib/python3.12/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/x/miniconda3/envs/stata/lib/python3.12/site-packages/joblib/externals/cloudpickle/cloudpickle.py", line 649, in subimport
__import__(name)
File "/usr/local/stata/utilities/pystata/stata.py", line 8, in <module>
config.check_initialized()
File "/usr/local/stata/utilities/pystata/config.py", line 281, in check_initialized
_RaiseSystemException('''
File "/usr/local/stata/utilities/pystata/config.py", line 86, in _RaiseSystemException
raise SystemError(msg)
SystemError:
Note: Stata environment has not been initialized yet.
To proceed, you must call init() function in the config module as follows:
from pystata import config
config.init()
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "test.py", line 299, in <module>
bootstrap(aggregation='occ')
File "test.py", line 277, in bootstrap
z = Parallel(backend='loky', n_jobs=-1)(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/x/miniconda3/envs/stata/lib/python3.12/site-packages/joblib/parallel.py", line 1098, in __call__
self.retrieve()
File "/home/x/miniconda3/envs/stata/lib/python3.12/site-packages/joblib/parallel.py", line 975, in retrieve
self._output.extend(job.get(timeout=self.timeout))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/x/miniconda3/envs/stata/lib/python3.12/site-packages/joblib/_parallel_backends.py", line 567, in wrap_future_result
return future.result(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/x/miniconda3/envs/stata/lib/python3.12/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/x/miniconda3/envs/stata/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
Using backend="multiprocessing"
as an argument to joblib.Parallel
will launch Stata instances in separate processes.