pythonpandasdataframepandas-settingwithcopy-warning

Why do I get a SettingWithCopyWarning when using shift and dropna inside a function?


In general, when I receive this warning

/home/mo/mwe.py:7: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
  Try using .loc[row_indexer,col_indexer] = value instead

I head to the second answer on How to deal with SettingWithCopyWarning in Pandas and try to reduce my code to one of the presented examples.

This time, however, I'm stuck with this not-so-minimal MWE:

import pandas as pd

def edit_df(df):
    df["shifted"] = df["base"].shift(-1)
    df["diff"] = df["shifted"] - df["base"]
    df = df.dropna()
    df["match"] = df["diff"] > 1
    return df

def main():
    df = pd.DataFrame({'base': [1,2]})
    df = edit_df(df)
    print(df)

main()

I tried to minimize it further, but the warning disappears when I remove any of the blocks or paste the function's code into main. Hence, my assumption is that the warning is caused of a combination of the operations. But it eludes me, why this would be the case. From this question I assume, I'm always working on the original dataframe, as intended.

From my understanding, I probably do a slicing somewhere, so I tried to place loc[: , 'column_name'] everywhere I assume slicing happens (edit_df_loc below), like the docs suggest. It does the modifications, but it still shows the warning.

Using dropna().copy() or dropna(inplace=True) makes the warning disappear. But I don't know, why I'd want to copy the dataframe (do I have to?) and inplace shouldn't be used.

Why do I face the warning and how do I properly fix it?


pandas version 2.3.3

I could well be missing terminology, so pointing me to a duplicate-target that explains the situation is also highly appreciated.

For reference, here are some of the variations that don't produce a warning and my attempt to use loc[]. I'm constructing a new dataframe every time, so there shouldn't be any slice upstream, as suggested here.

import pandas as pd

def edit_df(df):
    df["shifted"] = df["base"].shift(-1)
    df["diff"] = df["shifted"] - df["base"]
    df = df.dropna()
    df["match"] = df["base"] > 1
    return df

def edit_df1(df):
    df = df.dropna()
    df["match"] = df["base"] > 1
    return df

def edit_df2(df):
    df["shifted"] = df["base"].shift(-1)
    df["diff"] = df["shifted"] - df["base"]
    df = df.dropna()
    return df

def edit_df3(df):
    df["shifted"] = df["base"].shift(-1)
    df["diff"] = df["shifted"] - df["base"]
    df["match"] = df["base"] > 1
    return df

def edit_df_copy(df):
    df["shifted"] = df["base"].shift(-1)
    df["diff"] = df["shifted"] - df["base"]
    df = df.dropna().copy()
    df["match"] = df["base"] > 1
    return df

def edit_df_loc(df):
    df.loc[:, "shifted"] = df.loc[:, "base"].shift(-1)
    df.loc[:, "diff"] = df.loc[:, "shifted"] - df.loc[:, "base"]
    df = df.dropna()
    df.loc[:, "match"] = df.loc[:, "base"] > 1
    return df

def main():
    df = pd.DataFrame({'base': [1,2]})
    df = edit_df_copy(df)
    df = pd.DataFrame({'base': [1,2]})
    df = edit_df1(df)
    df = pd.DataFrame({'base': [1,2]})
    df = edit_df2(df)
    df = pd.DataFrame({'base': [1,2]})
    df = edit_df3(df)
    print(df)

main()


Solution

  • When a DataFrame is created, Pandas sets up a BlockManager that organizes the data into homogeneous blocks (NumpyBlocks), each containing one or more contiguous NumPy arrays of the same dtype. This allows efficient access, column alignment, and internal memory management.

    When dropna is used (and there are NaN values in the DataFrame), Pandas builds a new DataFrame by copying only the valid rows into new blocks, creating an independent object from the original DataFrame.

    However, Pandas still keeps some internal references and caches to maintain metadata consistency and memory management. These weakrefs do not point to the original DataFrame and do not affect the independence of the copied data.

    When you edit the DataFrame using slicing and a weakref is present, a SettingWithCopyWarning may appear because Pandas cannot be sure whether the object being modified is a view or a copy. This mechanism helps prevent ambiguous modifications that could affect the original DataFrame.

    A slightly modified version of your example:

    def log(msg: str, df: pd.DataFrame):
        print(f"{msg} {id(df):#0x}")
        print(f"Is a copy: {df._is_copy if df._is_copy else False}")
        print(f"Is a view: {df._is_view}")
        for blk in df._mgr.blocks:
            print(blk, hex(blk.values.__array_interface__["data"][0]))
        print()
    
    df = pd.DataFrame({'base': [1, 2, 3]})
    df["shifted"] = df["base"].shift(-1)
    df["diff"] = df["shifted"] - df["base"]
    log("Original", df)
    
    df1 = df.dropna()
    log("Dropna", df1)
    
    # del df  # <- this statement breaks the weakref
    df1.loc[:, "match"] = df1.loc[:, "base"] > 1
    log("Edit", df1)
    

    Output:

    Original 0x72fa2b1fb150
    Is a copy: False
    Is a view: False
    NumpyBlock: slice(0, 1, 1), 1 x 3, dtype: int64 0x625fad4f6180
    NumpyBlock: slice(1, 2, 1), 1 x 3, dtype: float64 0x625facea21b0
    NumpyBlock: slice(2, 3, 1), 1 x 3, dtype: float64 0x625facd26be0
    
    Dropna 0x72fa283cd7d0
    Is a copy: <weakref at 0x72fa2a776020; to 'DataFrame' at 0x72fa2b1fb150>
    Is a view: False
    NumpyBlock: slice(0, 1, 1), 1 x 2, dtype: int64 0x625facae34c0
    NumpyBlock: slice(1, 2, 1), 1 x 2, dtype: float64 0x625fad4f10a0
    NumpyBlock: slice(2, 3, 1), 1 x 2, dtype: float64 0x625fac8141c0
    
    <ipython-input-209-8245cb725357>:18: SettingWithCopyWarning: 
    A value is trying to be set on a copy of a slice from a DataFrame.
    Try using .loc[row_indexer,col_indexer] = value instead
    
    See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
      df1.loc[:, "match"] = df1.loc[:, "base"] > 1
    
    Edit 0x72fa283cd7d0
    Is a copy: <weakref at 0x72fa2a776020; to 'DataFrame' at 0x72fa2b1fb150>
    Is a view: False
    NumpyBlock: slice(0, 1, 1), 1 x 2, dtype: int64 0x625facae34c0
    NumpyBlock: slice(1, 2, 1), 1 x 2, dtype: float64 0x625fad4f10a0
    NumpyBlock: slice(2, 3, 1), 1 x 2, dtype: float64 0x625fac8141c0
    NumpyBlock: slice(3, 4, 1), 1 x 2, dtype: bool 0x625fac4807d0
    

    As you can see, dropna recreates all the NumpyBlocks, so the resulting DataFrame is independent from the original, yet a weakref is still created. However, when you edit a DataFrame obtained from a slice using .loc, the existing NumpyBlocks are reused, which can trigger a SettingWithCopyWarning because Pandas cannot tell if you are modifying a view or a copy.

    This behavior may change with Pandas 3.0