I am trying to create a Pandas DataFrame that is attached to an array stored in shared memory. I found a useful example from a response to another SO question. However, I have not been able to use this example in some cases.
In the following, a numpy.ndarray
is stored in shared memory and is attached to the Pandas DataFrame. This example works as expected - i.e., when the DataFrame is changed, the shared memory array is also updated. However, this only works for if I have an array with a single column.
import numpy as np
import pandas as pd
from multiprocessing import shared_memory
dtype = np.dtype('float64')
shm_block = shared_memory.SharedMemory(create=True, size=10*dtype.itemsize)
shm_array = np.ndarray((10,), dtype=dtype, buffer=shm_block.buf)
# Initialize the array with some values
shm_array[:] = np.arange(10)
# Convert to a Pandas DataFrame
df = pd.DataFrame(shm_array, columns=['col1'])
df['col1']*=2
# Verify the changes in the shared memory array
print(shm_array)
# Clean up shared memory
shm_block.close()
shm_block.unlink()
The above prints the updated array as expected.
[0. 2. 4. 6. 8. 10. 12. 14. 16. 18.]
Following snippet tries to expand the example to an array with two columns with float64
dtype as before. However, this gives unexpected results. It appears that Pandas decides to make a copy of the array in the DataFrame - and the share memory array is not connected to the DataFrame anymore.
import numpy as np
import pandas as pd
from multiprocessing import shared_memory
# Create a shared memory array
col_names = ['col1','col2']
col_types = ['float','float']
dtype = np.dtype({'names': col_names, 'formats': col_types})
shm_block = shared_memory.SharedMemory(create=True, size=10*dtype.itemsize)
shm_array = np.ndarray((10,), dtype=dtype, buffer=shm_block.buf)
shm_array[:] = np.arange(10)
print(shm_array)
# Convert to a Pandas DataFrame
df = pd.DataFrame(shm_array, columns=['col1','col2'], copy=False)
df['col1']*=2
print(df)
shm_array['col2']*=3
# Verify the changes in the shared memory array
print(shm_array)
# Clean up shared memory
shm_block.close()
shm_block.unlink()
Original Shared Memory:
[(0., 0.) (1., 1.) (2., 2.) (3., 3.) (4., 4.) (5., 5.) (6., 6.) (7., 7.) (8., 8.) (9., 9.)]
DataFrame with col1 updated:
col1 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 18.0
col2 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
Shared Memory array col2 updated:
[(0., 0.) (1., 3.) (2., 6.) (3., 9.) (4., 12.) (5., 15.) (6., 18.) (7., 21.) (8., 24.) (9., 27.)]
In the same post, @moon548834 mentioned that pd.DataFrame
is creating a copy. As in the above snipped, even if I use copy=False
option when creating the DataFrame, Pandas still chooses to create a copy.
I am unable to find a reference from documentation on how/when Pandas decides to create a copy of the array. Also, just adding another column (define through dtype
), seems to change the behavior.
I would really appreciate if anyone can point to what is wrong with the second snippet above. What is the easiest way to work with large arrays with diverse datatypes in columns using Pandas, Multiprocessing and Shared Memory.
Let's experiment with making a dataframe from a small array.
In [39]: x = np.arange(12).reshape(3,4)
And make a 4 column structured array from that as well:
In [40]: from numpy.lib import recfunctions as rf
In [41]: y = rf.unstructured_to_structured(x, 'i,i,i,i');y
Out[41]:
array([(0, 1, 2, 3), (4, 5, 6, 7), (8, 9, 10, 11)],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
Both arrays have the same base, the original 1d arange:
In [42]: x.base,y.base
Out[42]:
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]),
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]))
That is y
is the same as x
but with a different dtype
. The x.__array_interface__
(and y
) can be instructive here as well.
Now make dataframes from these (my numpy and pandas versions are a year or two out of date):
In [43]: df = pd.DataFrame(x); df
Out[43]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
In [44]: df1 = pd.DataFrame(y); df1
Out[44]:
f0 f1 f2 f3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Apart from the column names these look the same.
Now look at the values
(.to_numpy()
is I believe the preferred method):
In [45]: df.values
Out[45]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [46]: df.values.base
Out[46]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
For the x
based frame, the values base is the same, the arange
. And __array_interface['data']
confirms this.
In [47]: df1.values
Out[47]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]], dtype=int32)
In [48]: df1.values.base
Out[48]:
array([[ 0, 4, 8],
[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11]], dtype=int32)
But the values
from df1
is different. It does not share the x
base.
In the x/df
case, the underlying data is the x
2d array, one 'block'.
In the y/df1
case, I think pandas has created 4 columns/Series from the fields of y
.
There is, or at least was, some dataframe descriptor that showed the memory model or manager. info
doesn't do that. Docs talk about block versus array memory manager.
As for shared memory, there's a chance that the x
case will work, but there's no way that the y
(structured array) case can be shared.
Consistent with that memory model, changing a column of df
(inplace) changes x
(and even y
):
In [53]: df[1]*=2; df
Out[53]:
0 1 2 3
0 0 2 2 3
1 4 10 6 7
2 8 18 10 11
In [54]: df1
Out[54]:
f0 f1 f2 f3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
In [55]: x
Out[55]:
array([[ 0, 2, 2, 3],
[ 4, 10, 6, 7],
[ 8, 18, 10, 11]])
In [56]: y
Out[56]:
array([(0, 2, 2, 3), (4, 10, 6, 7), (8, 18, 10, 11)],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
But trying to change a column of df1
does not affect x
or y
:
In [60]: df1['f1'] *= 3
In [61]: df1
Out[61]:
f0 f1 f2 f3
0 0 3 2 3
1 4 15 6 7
2 8 27 10 11
In [62]: y
Out[62]:
array([(0, 2, 2, 3), (4, 10, 6, 7), (8, 18, 10, 11)],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
The copy parameter of DataFrame
says:
Copy data from inputs. For dict data, the default of None behaves like copy=True. For DataFrame or 2d ndarray input, the default of None behaves like copy=False. If data is a dict containing one or more Series (possibly of different dtypes), copy=False will ensure that these inputs are not copied.
I suspect that pd.DataFrame(y)
is the same as pd.DataFrame.from_records(y)
. from_records
calls to_arrays
. I can't find that function, but it sounds like it splits the structured array into multiple arrays (one per field?).
A structured array is normally used when you want different dtypes for each field. Using them just for column/field names is suboptimal. The pandas equivalent assigns a different dtype for each column/Series. Hence the field to column copy makes perfect sense.