pythoncsvbigdata

How to delete duplicates from one specific column in python with 20 milion rows


I would like to delete duplicates from one large csv. I have this csv format of data

client_id;gender;age;profese;addr_cntry;NAZOKRESU;prijem_AVG_6M_pasmo;cont_id;main_prod_id;bal_actl_am_pasmo
388713248;F;80;důchodce;CZ;Czech;;5715125;39775;
27953927;M;28;Dělník;CZ;Opavia;22;4427292;39075;

I need delete all duplicates from client_id. I can not handle this big file in python with Pandas. I tried dask, but same result. Just infinity time of waiting and nothing really happend.

Here is my last version of code


import dask.dataframe as dd
import chardet
from dask.diagnostics import ProgressBar

with open('bigData.csv', 'rb') as f:
    result = chardet.detect(f.read())

df = dd.read_csv('bigData.csv', encoding=result['encoding'], sep=';')

total_rows = df.shape[0].compute()

df = df.drop_duplicates(subset=['client_id'], keep=False, Inplace=True)

df.to_csv('bigData.csv', sep=';', index=False)

total_duplicates = total_rows - df.shape[0].compute()

print(f'Was deleted {total_duplicates} duplicated rows.')

I tried it with progress bar and nothing really happened. Thanks for help!


Solution

  • You might be able get away with a very simple Python program that stores every new ID it sees in a dict, and skips writing a subsequent row if it finds that row's ID already in the dict. It should require about 2GB of RAM.

    import csv
    
    reader = csv.reader(open("input.csv", newline=""))
    writer = csv.writer(open("output.csv", "w", newline=""))
    
    writer.writerow(next(reader))  # transfer header, if you have one
    
    ids = {}
    for row in reader:
        if row[0] not in ids:
            writer.writerow(row)
            ids[row[0]] = None  # add ID to "list" of already written IDs
    

    This approach:

    I mocked up a CSV w/20M rows (with randomly generated IDs between 0 and 20M), that looks something like this:

    | id       | i |
    |----------|---|
    | 2266768  | 0 |
    | 15245359 | 1 |
    | 16304974 | 2 |
    | 4801643  | 3 |
    | 9612409  | 4 |
    | 17659151 | 5 |
    | 15824934 | 6 |
    | 4101873  | 7 |
    | 12282127 | 8 |
    | 5172219  | 9 |
    

    I ran it through that program and ended up with 12.6M rows. On my Macbook Air M1 (dual-channel SSD) that took 14 seconds and consumed 1.5GB of RAM. The RAM is needed to hold all the previously seen IDs.

    Also, I see you reading the entire file first to detect the character encoding: