I would like to delete duplicates from one large csv. I have this csv format of data
client_id;gender;age;profese;addr_cntry;NAZOKRESU;prijem_AVG_6M_pasmo;cont_id;main_prod_id;bal_actl_am_pasmo
388713248;F;80;důchodce;CZ;Czech;;5715125;39775;
27953927;M;28;Dělník;CZ;Opavia;22;4427292;39075;
I need delete all duplicates from client_id. I can not handle this big file in python with Pandas. I tried dask, but same result. Just infinity time of waiting and nothing really happend.
Here is my last version of code
import dask.dataframe as dd
import chardet
from dask.diagnostics import ProgressBar
with open('bigData.csv', 'rb') as f:
result = chardet.detect(f.read())
df = dd.read_csv('bigData.csv', encoding=result['encoding'], sep=';')
total_rows = df.shape[0].compute()
df = df.drop_duplicates(subset=['client_id'], keep=False, Inplace=True)
df.to_csv('bigData.csv', sep=';', index=False)
total_duplicates = total_rows - df.shape[0].compute()
print(f'Was deleted {total_duplicates} duplicated rows.')
I tried it with progress bar and nothing really happened. Thanks for help!
You might be able get away with a very simple Python program that stores every new ID it sees in a dict, and skips writing a subsequent row if it finds that row's ID already in the dict. It should require about 2GB of RAM.
import csv
reader = csv.reader(open("input.csv", newline=""))
writer = csv.writer(open("output.csv", "w", newline=""))
writer.writerow(next(reader)) # transfer header, if you have one
ids = {}
for row in reader:
if row[0] not in ids:
writer.writerow(row)
ids[row[0]] = None # add ID to "list" of already written IDs
This approach:
ids
, to hold all IDs the program has already encountered and written; dicts can do really fast lookups/checks for its keys (your IDs).I mocked up a CSV w/20M rows (with randomly generated IDs between 0 and 20M), that looks something like this:
| id | i |
|----------|---|
| 2266768 | 0 |
| 15245359 | 1 |
| 16304974 | 2 |
| 4801643 | 3 |
| 9612409 | 4 |
| 17659151 | 5 |
| 15824934 | 6 |
| 4101873 | 7 |
| 12282127 | 8 |
| 5172219 | 9 |
I ran it through that program and ended up with 12.6M rows. On my Macbook Air M1 (dual-channel SSD) that took 14 seconds and consumed 1.5GB of RAM. The RAM is needed to hold all the previously seen IDs.
Also, I see you reading the entire file first to detect the character encoding:
Have you tried running chardetect
from the command line, chardetect input.csv
, and just hard-coding the value returned?
Have you experimented with reading a much smaller portion of the file and seeing what results and confidence you get?
with open("input.csv", "rb") as f:
input_enc = chardet.detect(f.read(1024 * 64)) # only read first 64K
print(input_enc) # {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}