There are countless questions about the dreaded SettingWithCopyWarning
I've got a good handle on how it comes about. (Notice I said good, not great)
It happens when a dataframe df
is "attached" to another dataframe via an attribute stored in is_copy
.
Here's an example
df = pd.DataFrame([[1]])
d1 = df[:]
d1.is_copy
<weakref at 0x1115a4188; to 'DataFrame' at 0x1119bb0f0>
We can either set that attribute to None
or
d1 = d1.copy()
I've seen devs like @Jeff and I can't remember who else, warn about doing that. Citing that the SettingWithCopyWarning
has a purpose.
Question
Ok, so what is a concrete example that demonstrates why ignoring the warning by assigning a copy
back to the original is a bad idea.
I'll define "bad idea" for clarification.
Bad Idea
It is a bad idea to place code into production that will lead to getting a phone call in the middle of a Saturday night saying your code is broken and needs to be fixed.
Now how can using df = df.copy()
in order to bypass the SettingWithCopyWarning
lead to getting that kind of phone call. I want it spelled out because this is a source of confusion and I'm attempting to find clarity. I want to see the edge case that blows up!
here is my 2 cent on this with a very simple example why the warning is important.
so assuming that I am creating a df such has
x = pd.DataFrame(list(zip(range(4), range(4))), columns=['a', 'b'])
print(x)
a b
0 0 0
1 1 1
2 2 2
3 3 3
now I want to create a new dataframe based on a subset of the original and modify it such has:
q = x.loc[:, 'a']
now this is a slice of the original and whatever I do on it will affect x:
q += 2
print(x) # checking x again, wow! it changed!
a b
0 2 0
1 3 1
2 4 2
3 5 3
this is what the warning is telling you. you are working on a slice, so everything you do on it will be reflected on the original DataFrame
now using .copy()
, it won't be a slice of the original, so doing an operation on q wont affect x :
x = pd.DataFrame(list(zip(range(4), range(4))), columns=['a', 'b'])
print(x)
a b
0 0 0
1 1 1
2 2 2
3 3 3
q = x.loc[:, 'a'].copy()
q += 2
print(x) # oh, x did not change because q is a copy now
a b
0 0 0
1 1 1
2 2 2
3 3 3
and btw, a copy just mean that q
will be a new object in memory. where a slice share the same original object in memory
imo, using .copy()
is very safe. as an example df.loc[:, 'a']
return a slice but df.loc[df.index, 'a']
return a copy. Jeff told me that this was an unexpected behavior and :
or df.index
should have the same behavior as an indexer in .loc[], but using .copy()
on both will return a copy, better be safe. so use .copy()
if you don't want to affect the original dataframe.
now using .copy()
return a deepcopy of the DataFrame, which is a very safe approach not to get the phone call you are talking about.
but using df.is_copy = None
, is just a trick that does not copy anything which is a very bad idea, you will still be working on a slice of the original DataFrame
one more thing that people tend not to know:
df[columns]
may return a view.
df.loc[indexer, columns]
also may return a view, but almost always does not in practice.
emphasis on the may here