How to delete duplicates from one specific column in python with 20 milion rows

I would like to delete duplicates from one large csv. I have this csv format of data

client_id;gender;age;profese;addr_cntry;NAZOKRESU;prijem_AVG_6M_pasmo;cont_id;main_prod_id;bal_actl_am_pasmo
388713248;F;80;důchodce;CZ;Czech;;5715125;39775;
27953927;M;28;Dělník;CZ;Opavia;22;4427292;39075;

I need delete all duplicates from client_id. I can not handle this big file in python with Pandas. I tried dask, but same result. Just infinity time of waiting and nothing really happend.

Here is my last version of code


import dask.dataframe as dd
import chardet
from dask.diagnostics import ProgressBar

with open('bigData.csv', 'rb') as f:
    result = chardet.detect(f.read())

df = dd.read_csv('bigData.csv', encoding=result['encoding'], sep=';')

total_rows = df.shape[0].compute()

df = df.drop_duplicates(subset=['client_id'], keep=False, Inplace=True)

df.to_csv('bigData.csv', sep=';', index=False)

total_duplicates = total_rows - df.shape[0].compute()

print(f'Was deleted {total_duplicates} duplicated rows.')

I tried it with progress bar and nothing really happened. Thanks for help!

Solution

You might be able get away with a very simple Python program that stores every new ID it sees in a dict, and skips writing a subsequent row if it finds that row's ID already in the dict. It should require about 2GB of RAM.

import csv

reader = csv.reader(open("input.csv", newline=""))
writer = csv.writer(open("output.csv", "w", newline=""))

writer.writerow(next(reader))  # transfer header, if you have one

ids = {}
for row in reader:
    if row[0] not in ids:
        writer.writerow(row)
        ids[row[0]] = None  # add ID to "list" of already written IDs

This approach:

Uses a dict, ids, to hold all IDs the program has already encountered and written; dicts can do really fast lookups/checks for its keys (your IDs).
Keeps the original ordering of the rows.

I mocked up a CSV w/20M rows (with randomly generated IDs between 0 and 20M), that looks something like this:

| id       | i |
|----------|---|
| 2266768  | 0 |
| 15245359 | 1 |
| 16304974 | 2 |
| 4801643  | 3 |
| 9612409  | 4 |
| 17659151 | 5 |
| 15824934 | 6 |
| 4101873  | 7 |
| 12282127 | 8 |
| 5172219  | 9 |

I ran it through that program and ended up with 12.6M rows. On my Macbook Air M1 (dual-channel SSD) that took 14 seconds and consumed 1.5GB of RAM. The RAM is needed to hold all the previously seen IDs.

Also, I see you reading the entire file first to detect the character encoding:

Have you tried running chardetect from the command line, chardetect input.csv, and just hard-coding the value returned?

Have you experimented with reading a much smaller portion of the file and seeing what results and confidence you get?

with open("input.csv", "rb") as f:
    input_enc = chardet.detect(f.read(1024 * 64))  # only read first 64K

print(input_enc)  # {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}