pythondataframefor-loop

Using for-loops to process multiple pandas dataframes


I'm processing many dataframes and want to remove null and negative values using a for-loop. The python code seems like it should work, but it doesn't. I'm wondering why this logic doesn't work in python?

df1 = pd.DataFrame({'depth': [-1, 2, 3, 4, np.nan], 'temp': [-1,2,3,4,5]})
df2 = pd.DataFrame({'depth': [1, 2, 3, 4, 5],       'temp': [-1,2,3,4,5]})
df3 = pd.DataFrame({'depth': [1, 2, 3, 4, 5],       'temp': [-1,2,3,4,np.nan]})

df_names=(df1, df2, df3)

for i in df_names:
    i = i.dropna()
    i = i[i['temp']>0]
    i = i[i['depth']>0]

print(df1, '\n', df2,'\n', df3)

enter image description here


Solution

  • The reason why your code doesn't work is that you use the assignment operation in a for loop. And this results in a new variable df being created.

    This is how it works:

    1. Let's check where our initial variables are stored:
    df1 = pd.DataFrame({'depth': [-1, 2, 3, 4, np.nan], 'temp': [-1,2,3,4,5]})
    df2 = pd.DataFrame({'depth': [1, 2, 3, 4, 5],       'temp': [-1,2,3,4,5]})
    df3 = pd.DataFrame({'depth': [1, 2, 3, 4, 5],       'temp': [-1,2,3,4,np.nan]})
    
    df_names=[df1, df2, df3]
    print(id(df1), id(df_names[0]))
    
    2495042353424 2495042353424
    

    Great, df1 and 0-indexed element of the list are stored in the same place of memory (which makes scence).

    1. Then let's run a for loop with an assigment operation inside it and check if we still operate on the same object (I simplified your code here a bit):
    for i in df_names:
        print(id(i))
        i = i.dropna()
        print('after assignment:', id(i))
    
    2495042353424
    after assignment: 2495042514512
    2495042411984
    after assignment: 2495042516048
    2495006121552
    after assignment: 2495042354768
    

    Here you can see that after the assigment operation a new temporary object df was created (in a different place in memory) and you apply all operations (dropna and then dataframe filtering) on it! And when an each iteration over for loop is over this object is simply destroyed and do not affect your initial data whatsoever.

    How you can fix it? In this particular example I can suggest you simply avoid the assignment operation and use dataframe in-place methods:

    for df in df_names:
        df.dropna(inplace = True)
        df.drop(df[~(df['temp']>0)].index, inplace = True)
        df.drop(df[~(df['depth']>0)].index, inplace = True)
    print(df1)
    
        depth   temp
    1   2.0      2
    2   3.0      3
    3   4.0      4