pythonpandasdrop-duplicates

How to drop_duplicates in python


I have to compare to csv files, which I need to drop the duplicate rows and generate another file.

#here I´m comparing the csv files. The oldest_file and the newest_file
different_data_type = newest_file.equals(other = oldest_file) 
#If they have differences, I concat them to drop those rows that are equals
merged_files = pd.concat([oldest_file, newest_file])
        
merged_files = merged_files.drop_duplicates()
print(merged_files())

Each csv file has about 5.000 rows, and when I print merged_files, I´m receiving a 10.000 row csv file. In other words, it´s not dropping.

How can I get only the rows that has differences?


Solution

  • I think you are missing to indicate columns in drop_duplicates(), try using like

    df.drop_duplicates(subset=['column1', 'column2'])
    

    One other way is to find duplicates in your merged file and then delete them from merged_files:

    duplicate_rows = merged_files.duplicated(subset=['column1', 'column2'])
    merged_files = merged_files[~duplicate_rows]