pythonpandasdictionarycopy

Deep copy of Pandas dataframes and dictionaries


I'm creating a small Pandas dataframe:

df = pd.DataFrame(data={'colA': [["a", "b", "c"]]})

I take a deepcopy of that df. I'm not using the Pandas method but general Python, right?

import copy
df_copy = copy.deepcopy(df)

A df_copy.head() gives the following:

enter image description here

Then I put these values into a dictionary:

mydict = df_copy.to_dict()

That dictionary looks like this:

enter image description here

Finally, I remove one item of the list:

mydict['colA'][0].remove("b")

I'm surprized that the values in df_copy are updated. I'm very confused that the values in the original dataframe are updated too! Both dataframes look like this now:

enter image description here

I understand Pandas doesn't really do deepcopy, but this wasn't a Pandas method. My questions are:

1) how can I build a dictionary from a dataframe that doesn't update the dataframe?

2) how can I take a copy of a dataframe which would be completely independent?

thanks for your help!

Cheers, Nicolas


Solution

  • TLDR

    To get deepcopy:

    df_copy = pd.DataFrame(
        columns = df.columns, data = copy.deepcopy(df.values)
    )
    

    Disclaimer


    Notice that putting mutable objects inside a DataFrame can be an antipattern so make sure you need it and understand what you are doing.

    Why your copy is not independent


    When applied on an object, copy.deepcopy is looked up for a _deepcopy_ method of that object, that is called in turn. It's added to avoid copying too much for objects. In the case of a DataFrame instance in version 0.20.0 and above - _deepcopy_ doesn`t work recursively.

    Similarly, if you will use DataFrame.copy(deep=True) deep copy will copy the data, but will not do so recursively. .

    How to solve the problem


    To take a truly deep copy of a DataFrame containing a list(or other python objects), so that it will be independent - you can use one of the methods below.

    df_copy = pd.DataFrame(
        columns = df.columns, data = copy.deepcopy(df.values)
    )
    

    For a dictionary, you may use same trick:

    mydict = pd.DataFrame(
        columns = df.columns, data = copy.deepcopy(df_copy.values)
    ).to_dict()
    mydict['colA'][0].remove("b")
    

    There's also a standard hacky way of deep-copying python objects:

    import pickle
    df_copy = pickle.loads(pickle.dumps(df))  
    

    Feel free to ask for any clarifications, if needed.