pythonpandaschained-assignment

why is blindly using df.copy() a bad idea to fix the SettingWithCopyWarning


There are countless questions about the dreaded SettingWithCopyWarning

I've got a good handle on how it comes about. (Notice I said good, not great)

It happens when a dataframe df is "attached" to another dataframe via an attribute stored in is_copy.

Here's an example

df = pd.DataFrame([[1]])

d1 = df[:]

d1.is_copy

<weakref at 0x1115a4188; to 'DataFrame' at 0x1119bb0f0>

We can either set that attribute to None or

d1 = d1.copy()

I've seen devs like @Jeff and I can't remember who else, warn about doing that. Citing that the SettingWithCopyWarning has a purpose.

Question
Ok, so what is a concrete example that demonstrates why ignoring the warning by assigning a copy back to the original is a bad idea.

I'll define "bad idea" for clarification.

Bad Idea
It is a bad idea to place code into production that will lead to getting a phone call in the middle of a Saturday night saying your code is broken and needs to be fixed.

Now how can using df = df.copy() in order to bypass the SettingWithCopyWarning lead to getting that kind of phone call. I want it spelled out because this is a source of confusion and I'm attempting to find clarity. I want to see the edge case that blows up!


Solution

  • here is my 2 cent on this with a very simple example why the warning is important.

    so assuming that I am creating a df such has

    x = pd.DataFrame(list(zip(range(4), range(4))), columns=['a', 'b'])
    print(x)
       a  b
    0  0  0
    1  1  1
    2  2  2
    3  3  3
    

    now I want to create a new dataframe based on a subset of the original and modify it such has:

     q = x.loc[:, 'a']
    

    now this is a slice of the original and whatever I do on it will affect x:

    q += 2
    print(x)  # checking x again, wow! it changed!
       a  b
    0  2  0
    1  3  1
    2  4  2
    3  5  3
    

    this is what the warning is telling you. you are working on a slice, so everything you do on it will be reflected on the original DataFrame

    now using .copy(), it won't be a slice of the original, so doing an operation on q wont affect x :

    x = pd.DataFrame(list(zip(range(4), range(4))), columns=['a', 'b'])
    print(x)
       a  b
    0  0  0
    1  1  1
    2  2  2
    3  3  3
    
    q = x.loc[:, 'a'].copy()
    q += 2
    print(x)  # oh, x did not change because q is a copy now
       a  b
    0  0  0
    1  1  1
    2  2  2
    3  3  3
    

    and btw, a copy just mean that q will be a new object in memory. where a slice share the same original object in memory

    imo, using .copy()is very safe. as an example df.loc[:, 'a'] return a slice but df.loc[df.index, 'a'] return a copy. Jeff told me that this was an unexpected behavior and : or df.index should have the same behavior as an indexer in .loc[], but using .copy() on both will return a copy, better be safe. so use .copy() if you don't want to affect the original dataframe.

    now using .copy() return a deepcopy of the DataFrame, which is a very safe approach not to get the phone call you are talking about.

    but using df.is_copy = None, is just a trick that does not copy anything which is a very bad idea, you will still be working on a slice of the original DataFrame

    one more thing that people tend not to know:

    df[columns] may return a view.

    df.loc[indexer, columns] also may return a view, but almost always does not in practice. emphasis on the may here