python-3.xpandasjupyter-notebookdrop-duplicates

Is there any faster alternative to col.drop_duplicates()?


I am trying to remove duplicates data in my dataframe (csv) and get a separate csv to show the unique answers of each column. The problem is that my code has been running for a day (22 Hours to be exact) I´m open to some other suggestions.

My data has about 20,000 rows with headers (example). I have tried to check the unique list one by one before like df[col].unique() and it does not take that long.

df = pd.read_csv('Surveydata.csv')
df_uni = df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
df_uni.to_csv('Surveydata_unique.csv', index=False)

What I expect is the dataframe that has the same set of columns but without any duplication in each field (example). Ex. if df['Rmoisture'] has a combination of Yes,No,Nan it should have only these 3 contain in the same column of another dataframe df_uni.


Solution

  • Another method:

    new_df = []
    [new_df.append(pd.DataFrame(df[i].unique(), columns=[i])) for i in df.columns]
    new_df = pd.concat(new_df,axis=1)
    print(new_df)
    
    
       Mass     Length  Material  Special Mark  Special Num  Breaking  \
    0    4.0   5.500000     Wood            A         20.0      Yes   
    1   12.0   2.600000    Steel          NaN          NaN       No   
    2    1.0   3.500000   Rubber            B          5.5      NaN   
    3   15.0   6.500000  Plastic            X          6.6      NaN   
    4    6.0  12.000000      NaN          NaN          5.6      NaN   
    5   14.0   2.500000      NaN          NaN          6.3      NaN   
    6    2.0  15.000000      NaN          NaN          NaN      NaN   
    7    8.0   2.000000      NaN          NaN          NaN      NaN   
    8    7.0  10.000000      NaN          NaN          NaN      NaN   
    9    9.0   2.200000      NaN          NaN          NaN      NaN   
    10  11.0   4.333333      NaN          NaN          NaN      NaN   
    11  13.0   4.666667      NaN          NaN          NaN      NaN   
    12   NaN   3.750000      NaN          NaN          NaN      NaN   
    13   NaN   1.666667      NaN          NaN          NaN      NaN   
    
                      Comment  
    0        There is no heat  
    1                     NaN  
    2       Contains moisture  
    3   Hit the table instead  
    4          A sign of wind  
    5                     NaN  
    6                     NaN  
    7                     NaN  
    8                     NaN  
    9                     NaN  
    10                    NaN  
    11                    NaN  
    12                    NaN  
    13                    NaN