I'm processing many dataframes and want to remove null and negative values using a for-loop. The python code seems like it should work, but it doesn't. I'm wondering why this logic doesn't work in python?
df1 = pd.DataFrame({'depth': [-1, 2, 3, 4, np.nan], 'temp': [-1,2,3,4,5]})
df2 = pd.DataFrame({'depth': [1, 2, 3, 4, 5], 'temp': [-1,2,3,4,5]})
df3 = pd.DataFrame({'depth': [1, 2, 3, 4, 5], 'temp': [-1,2,3,4,np.nan]})
df_names=(df1, df2, df3)
for i in df_names:
i = i.dropna()
i = i[i['temp']>0]
i = i[i['depth']>0]
print(df1, '\n', df2,'\n', df3)
The reason why your code doesn't work is that you use the assignment operation in a for loop. And this results in a new variable df
being created.
This is how it works:
df1 = pd.DataFrame({'depth': [-1, 2, 3, 4, np.nan], 'temp': [-1,2,3,4,5]})
df2 = pd.DataFrame({'depth': [1, 2, 3, 4, 5], 'temp': [-1,2,3,4,5]})
df3 = pd.DataFrame({'depth': [1, 2, 3, 4, 5], 'temp': [-1,2,3,4,np.nan]})
df_names=[df1, df2, df3]
print(id(df1), id(df_names[0]))
2495042353424 2495042353424
Great, df1
and 0-indexed element of the list are stored in the same place of memory (which makes scence).
for i in df_names:
print(id(i))
i = i.dropna()
print('after assignment:', id(i))
2495042353424
after assignment: 2495042514512
2495042411984
after assignment: 2495042516048
2495006121552
after assignment: 2495042354768
Here you can see that after the assigment operation a new temporary object df
was created (in a different place in memory) and you apply all operations (dropna and then dataframe filtering) on it! And when an each iteration over for loop is over this object is simply destroyed and do not affect your initial data whatsoever.
How you can fix it? In this particular example I can suggest you simply avoid the assignment operation and use dataframe in-place methods:
for df in df_names:
df.dropna(inplace = True)
df.drop(df[~(df['temp']>0)].index, inplace = True)
df.drop(df[~(df['depth']>0)].index, inplace = True)
print(df1)
depth temp
1 2.0 2
2 3.0 3
3 4.0 4