I am trying to remove duplicates data in my dataframe (csv) and get a separate csv to show the unique answers of each column. The problem is that my code has been running for a day (22 Hours to be exact) I´m open to some other suggestions.
My data has about 20,000 rows with headers (example). I have tried to check the unique list one by one before like df[col].unique() and it does not take that long.
df = pd.read_csv('Surveydata.csv')
df_uni = df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
df_uni.to_csv('Surveydata_unique.csv', index=False)
What I expect is the dataframe that has the same set of columns but without any duplication in each field (example). Ex. if df['Rmoisture'] has a combination of Yes,No,Nan it should have only these 3 contain in the same column of another dataframe df_uni.
Another method:
new_df = []
[new_df.append(pd.DataFrame(df[i].unique(), columns=[i])) for i in df.columns]
new_df = pd.concat(new_df,axis=1)
print(new_df)
Mass Length Material Special Mark Special Num Breaking \
0 4.0 5.500000 Wood A 20.0 Yes
1 12.0 2.600000 Steel NaN NaN No
2 1.0 3.500000 Rubber B 5.5 NaN
3 15.0 6.500000 Plastic X 6.6 NaN
4 6.0 12.000000 NaN NaN 5.6 NaN
5 14.0 2.500000 NaN NaN 6.3 NaN
6 2.0 15.000000 NaN NaN NaN NaN
7 8.0 2.000000 NaN NaN NaN NaN
8 7.0 10.000000 NaN NaN NaN NaN
9 9.0 2.200000 NaN NaN NaN NaN
10 11.0 4.333333 NaN NaN NaN NaN
11 13.0 4.666667 NaN NaN NaN NaN
12 NaN 3.750000 NaN NaN NaN NaN
13 NaN 1.666667 NaN NaN NaN NaN
Comment
0 There is no heat
1 NaN
2 Contains moisture
3 Hit the table instead
4 A sign of wind
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN