pythonpandasdataframe

Pandas API behaviour when CopyOnWrite is enabled


I am new to pandas and I am trying to catch up on its API design. What I am most interested in is to get a good rule of thumb to predict wheter calling a method on a dataframe will return a new copy of it (that I must assign to a variable) or will modify it inplace.

The documentation mentions everywhere that Copy-On-Write will be the future standard, therefore I have enabled it setting pd.options.mode.copy_on_write = True and I am only interested in its behaviour when copy on write is active.

Here is an example of the transformations I need to apply to a data set loaded from an Excel sheet. Although the snippet below seems to do what I need, I always have to reassign to the variable df the modified dataframe returned by each method.

df = pd.read_excel("my_excel_file.xls", sheet_name="my_sheet", usecols="A:N")  # load dataframe from excel sheet
df = df.dropna(how='all')            # remove empty rows
df = df.iloc[:-1,:]                  # remove last row
df.columns.array[0] = "Resource"     # change name of the first column
df = df.astype({"Resource": int})    # change column type
df.columns = df.columns.str.replace('Avg of ', '').str.replace('Others', 'others')  # alter column names
df = df.set_index("Resource")        # use 'Resource' column as index
df = df.sort_index(axis=0)           # sort df by index value
df = df / 100                        # divide each entry by 100
df = df.round(4)                     # round to 4 decimals
df = df.reindex(columns=sorted(df))  # order columns in ascending alphabetical order

What is the recommended way to carry out the operations in the snippet above? Is it correct to assume that each method that modifies the dataframe is not applied inplace and returns a new dataframe object that I need to assign to a variable? More generally, is reassigning the variable df after each step the recommended way to use pandas API?


Solution

  • With some minor modifications (and corrections) to only use methods which return a DF then you can just chain the methods rather than repeatedly creating variables. This is also clear and concise; any one operation can be commented out for testing. So your code becomes:

    df = (df
        .dropna(how='all')            # remove empty rows
        .iloc[:-1,:]                  # remove last row
        .rename(columns= {df.columns[0]: "Resource"})    # change name of the first column
        .astype({"Resource": int})    # change column type
        .rename(columns = lambda s: s.replace('Avg of ', '').replace('Others', 'others'))  # alter column names
        .set_index("Resource")        # use 'Resource' column as index
        .sort_index(axis=0)           # sort df by index value
        .div(100)                       # divide each entry by 100
        .round(4)                     # round to 4 decimals
        .sort_index(axis = 1)  # order columns in ascending alphabetical order
        )