I want to share data between several Python scripts running in different environments. My data comes as Pandas Dataframes (and dask dataframes). Typically, the Dataframes contain floats, ints, strings, and pd.timestamps. Their size can be large (about 15,000,000 rows x 20 columns).
Unfortunately, the dependencies of my script environments are incompatible. Not only are the Pandas versions different (2.0, 1.2), but so are the Python versions (3.10 & 3.8).
Here is a brief summary of my programs scripts:
A simple dataflow diagramm:
C <-┌-- B <---> A
└---------- A
Due to the size of the Dataframes, I want to avoid writing anything to disk to share my data.
I've read about streaming pickled data, but since A and B need to communicate back and forth, this won't work since pickle is only downward (not upward) compatible.
Is there any way to avoid converting the Dataframes before and after sharing (for e.g. df -> dict -> share -> dict -> df) to reduce overhead?
From the dask point of view, (distributed) clusters are multi-tenant, so you can totally connect from one client and create a dataset for another client to use ( https://distributed.dask.org/en/stable/publish.htm ).
However, you normally wish and require matched versions. If the client dask/distributed version does not match the scheduler and workers, things may or may not work at all (expect warnings at least). IF you use the dask.dataframe API in the client, this is a problem as pickling would be the way data is copied between workers.
However, you can run plain functions on workers, invoked from the client, even to create dataframes and "publish" them.
def make_it():
from distributed import get_client
import dask.dataframe as dd
client = get_client()
df = ...
df2 = client.persist(df)
client.publish_dataset(negative_accounts=df2)
client = dask.distributed.Client(<cluster address>)
client.submit(make_it)
With this approach, the first client is using the versions of pandas on the workers and not pickling anything except the function code, so this very likely might work. If the second client session does have matched versions, it can client.get_dataset without problem.