pythonpandastype-conversionpickle

Pandas astype becomes in-place operation for data loaded from pickle files


Pandas astype() appears to unexpectedly switch to performing in-place operations after loading data from a pickle file. Concretly, for astype(str), the data type of the input dataframe values is modified. What is causing this behavior?

Pandas version: 2.0.3

Minimal example:

import pandas as pd
import numpy as np

# create a test dataframe
df = pd.DataFrame({'col1': ['hi']*10 + [False]*20 + [np.nan]*30})

# print the data types of the cells, before and after casting to string
print(pd.unique([type(elem) for elem in df['col1'].values]))
_ = df.astype(str)
print(pd.unique([type(elem) for elem in df['col1'].values]))

# store the dataframe as pkl and directly load it again
outpath = 'C:/Dokumente/my_test_df.pkl'
df.to_pickle(outpath)
df2 = pd.read_pickle(outpath)

# print the data types of the cells, before and after casting to string
print(pd.unique([type(elem) for elem in df2['col1'].values]))
_ = df2.astype(str)
print(pd.unique([type(elem) for elem in df2['col1'].values]))

Output:

enter image description here


Solution

  • This is a bug that has been fixed in pandas 2.2.0:

    Bug in DataFrame.astype() when called with str on unpickled array - the array might change in-place (GH 54654)

    As noted by Itayazolay in the PR, regarding the pickle MRE used there:

    The problem is not exactly with pickle, it's just a quick way to reproduce the problem.
    The problem is that the code here attempts to check if two arrays have the same memory (or share memory) and it does so incorrectly - result is arr
    See numpy/numpy#24478 for more technical details.

    If you're using a version < 2.2 and cannot upgrade, you could try manually applying the fix mentioned in the PR and recompiling ".../pandas/_libs/lib.pyx".

    At #L759:

        if copy and result is arr:
            result = result.copy()
    

    Required change:

        if copy and (result is arr or np.may_share_memory(arr, result)):
            result = result.copy()
    

    There are now some extra comments in ".../pandas/_libs/lib.pyx", version 2.3.x, together with adjusted logic. See #L777-L785:

        if result is arr or np.may_share_memory(arr, result):
            # if np.asarray(..) did not make a copy of the input arr, we still need
            #  to do that to avoid mutating the input array
            # GH#54654: share_memory check is needed for rare cases where np.asarray
            #  returns a new object without making a copy of the actual data
            if copy:
                result = result.copy()
            else:
                already_copied = False