pythondasknumpy-ndarraydask-distributedmsgpack

msgpack could not serialize large numpy ndarrays


I trying to send large numpy ndarrays through client.scatter(np_ndarray). The np_ndarray is about 10GB; I am getting this error msgpack Could not serialize object of type ndarray.

I used alternatively pickle while creating my client, this way Client(self.adr, serializers=['dask', 'pickle']).

Thank you!


Solution

  • To answer your other three questions:

    Is msgpack always used when data is sent by scatter, or dask decides about the protocol depending on the data type?

    Yes, Dask does select a default serializer depending on your data, ref: Dask Docs - Serialization

    I noticed that there is a project for Msgpack-Numpy. Are you planning to add support for it in dask, in case I describe an eventual issue in dask?

    I checked with a Dask contributor, and looks there is no plan to support it, now or in the near future. That said, please feel free to start a discussion to gather more thoughts. :)

    When I initialize my client this way, what are the main advantages and disadvantages?

    Serialization in Dask is tricky, so it's hard to define (dis)advantages. But, generally speaking, manually specifying serializers is not recommended.