pythoncastingmultiprocessingshared-memoryvoid-pointers

Accessing any object type from multiprocessing shared_memory?


Suppose I create a shared memory object:

from multiprocessing import shared_memory
shm_a = shared_memory.SharedMemory(create=True, size=1024)
buffer = shm_a.buf

and put a generic object of a generic class, such as:

class GenericClass:
    def __init__(self, a, b):
        self.a = a
        self.b = b

in it:

gen_obj_a = GenericClass(1,6)
buffer = gen_obj_a

Now, in another terminal, I have:

from multiprocessing import shared_memory
existing_shm = shared_memory.SharedMemory(name='psm_21467_46075')

How do I assign a variable, say gen_obj_b, to the GenericClass object in shared memory?

I want to be able to do this where GenericClass is much more complex that the example above and it doesn't have a serialization function.

In C++, one would do this by casting a void * to the GenericClass object type, but how is this done in Python with shared memory?

cf. multiprocessing.shared_memory Python documentation


Solution

  • Why not

    You can't do that directly in Python, because objects are managed by the runtime.

    Your GenericClass instance, while it is represented by bytes in a given data structure in memory, can't simply have these same bytes acessed as an instance in another Python process, even if that process uses the same Python modules as the caller: an instance has internal pointers to the class object itself (for one example), and the class object will be in a different address in the other interpreter. (It might be in the same, in rare cases, but it would be a matter of luck.) All other references in an object transposed in this way would fail.

    Besides references, there is also the reference-count problem: when the object becomes "official" in the target process, its reference count would increase, breaking the object accounting machinery on the caller in subtle ways.

    As side-side note: "In C++, one would do this by casting a void * to the GenericClass": Keep in mind that in C++ and other static languages, the information that those bytes are an instance of GenericClass are hardcoded in the source code of the program. The runtime doesn't know that — it knows that to pick an integer at field a it will use the 4 bytes at the offset 0 on the object-buffer address. But in Python, the class to which an object belongs is stored in the instance itself. Under the hood, as a "pointer" (which we call "a reference" in Python-parlance), and the information needed to the memory layout of the fields in the class is dynamically retrieved.

    similar approach which advances a bit, but no, don't:

    As a side note, perceive that from Python 3.12 onwards, subinterpreters are reachable through Python code, and if instead of other processes you do that from other sub-interpreters in the same process, this may work. I actually have some code that does that in https://github.com/jsbueno/extrainterpreters - (not updated for Python 3.14 yet).

    Yet even in the "extrainterpreters" code, I don't make this the main way of acessing a given instance from another interpreters — the reference counting would still be a problem, as well as referenced objects in associated attributes (lists, dictionaries, other instances): you could get parallel access to these data structures without the protection of the GIL (neither from the finer locks used in Python's free-threaded buildings).

    What works:

    You serialize your object in process 1, copy the serialized bytes to the shared memory, and de-serialize it from there. Python's default serialization machinery: pickle is near magical.

    Of course, there is a lot of overhead compared with simply using the object in the other process (~2-3 orders of magnitude), but this is the "canon":

    First your class has to exist in a module that is importable from both processes: generic.py

    class GenericClass:
        def __init__(self, a, b):
            self.a = a
            self.b = b
    

    Then, this in one terminal:

    from multiprocessing import shared_memory
    from generic import GenericClass
    import pickle
    
    shm = shared_memory.SharedMemory(create=True, size=1024)
    print(shm.name)
    
    obj1 = GenericClass(5, 6)
    serialized = pickle.dumps(obj1)
    shm.buf[0: len(serialized)] = serialized
    

    And this will work on the second terminal:

    from multiprocessing import shared_memory
    import pickle
    
    shm_existing = shared_memory.SharedMemory("psm_ff9c5e26")
    
    obj2 = pickle.loads(shm_existing.buf)
    
    # note you don't have to explicitly import `generic.py` here:
    # pickle will do that for you.
    

    Complementing the side-side note from section 1 above: Pickle can retrieve information about the class even without an explicit import, because the serialized data include the __module__ class attribute reachable from the original instance - and it is a string with the information on how that module is imported. pickle internally issues the equivalent of an import _pickled_data_.instance.__module__ statement before instantiating it in the target process.

    Aside that, serialization is used by Python's multiprocessing stdlib module, as well as by concurrent.futures, and the primitives for data sharing in there like multiprocessing.Queue , so if you can use those to manage your other processes, you can make use of these constructs to have a more straightforward experience than having to manually pickle and unpickle things (the Queue class does that for you).

    Real world scenario

    If you have Python objects that have some attributes, and then a large buffer with numeric data (like a dataframe from pandas or polars) - it can be possible to serialize just the "boileplate" side of the dataframe, and share the actual object buffer without copying, thus gaining speed.

    But trying to implement that in your own may be complex check PEP 574 for a starter. This may be what you need in the end - but also, projects like Dask already implement that kind of thing, and can be much faster than a plain pickle serializing.

    Sharing "live" objects directly:

    If you want to be able to track object attributes across processes, so that ojb.a=5 in process 2 is visible in process 1 in a safe way, there are the tools and classes under multiprocessing.manager. These are available by default in Python, and involve a somewhat complex architecture which may include a 3rd process to manage communication across processes - I've personally not used those in a major project, and don't know of people who have - but there are certain some use scenarios for it.