pythoncsvpandasblaze

Delete column(s) from very large CSV file using pandas or blaze


I have a very large csv file (5 GB), so I do not want to load the whole thing into memory, and I want to delete one or more of its columns. I tried using the following code in blaze, but all it did was append the resulting columns to the existing csv file:

from blaze import Data, odo
d = Data("myfile.csv")
d = d[columns_I_want_to_keep]
odo(d, "myfile.csv")

Is there a way using either pandas or blaze to only keep the columns I want and delete the other ones?


Solution

  • You can use dask.dataframe, which is syntactically similar to pandas, but does manipulations out-of-core so memory shouldn't be an issue. It also parallelizes the process automatically, so it should be fast.

    import dask.dataframe as dd
    
    df = dd.read_csv('myfile.csv', usecols=['col1', 'col2', 'col3'])
    df.to_csv('output.csv', index=False)
    

    Timings

    I've timed each method posted so far on a 1.4 GB csv file. I kept four columns, leaving the output csv file at 250 MB.

    Using Dask:

    %%timeit
    df = dd.read_csv(f_in, usecols=cols_to_keep)
    df.to_csv(f_out, index=False)
    
    1 loop, best of 3: 41.8 s per loop
    

    Using Pandas:

    %%timeit
    chunksize = 10**5
    for chunk in pd.read_csv(f_in, chunksize=chunksize, usecols=cols_to_keep):
        chunk.to_csv(f_out, mode='a', index=False)
    
    1 loop, best of 3: 44.2 s per loop
    

    Using Python/CSV:

    %%timeit
    inc_f = open(f_in, 'r')
    csv_r = csv.reader(inc_f)
    out_f = open(f_out, 'w')
    csv_w = csv.writer(out_f, delimiter=',', lineterminator='\n')
    for row in csv_r:
        new_row = [row[1], row[5], row[6], row[8]]
        csv_w.writerow(new_row)
    inc_f.close()
    out_f.close()
    
    1 loop, best of 3:  1min 1s per loop