I have an issue which was already discussed in several topics, nevertheless i would like to go a bit deeper and maybe find a better solution.
So the idea is to go through "huge" (50 to 60GB) .csv files with python, find the lines which satisfy some conditions, extract them and finally store them in a second variable for further analysis.
Initially the problem was for r scripts, which i manage with sparklyr connection, or eventually some gawk code in bash (see awk, or gawk), to extract the data I need, then analyse it with R/python.
I would like to resolve this issue exclusively with python, the idea would be to avoid mixing languages like bash/python, or bash/R (unix). So far i use the open as x, and go through file line by line, and it kinda works, but it's awfully slow. For example, going through the file is pretty fast (~500.000 lines per second, even for a 58M lines is ok), but when I try to store the data, the speed drops to ~10 lines per second. For an extraction with ~300.000 lines, it's unacceptable.
I tried several solutions and I'm guessing that it's not optimal (poor python code ? :( ) and better solutions eventually exist.
Solution 1: go through file, split the line in a list, check the conditions, if ok put the line in numpy matrix and vstack for each iteration which is satisfying the condition (very slow)
import csv
import numpy
import pandas
from tqdm import tqdm
date_first = '2008-11-01'
date_last = '2008-11-10'
a = numpy.array(['colnames']*35) #data is 35 columns
index = list()
with open("data.csv", "r") as f:
for line in tqdm(f, unit = " lines per"):
line = line.split(sep = ";") # csv with ";" ...
date_file = line[1][0:10] # date stored in the 2nd column
if date_file >= date_first and date_file <= date_last : #data extraction concern a time period (one month for example)
line=numpy.array(line) #go to numpy
a=numpy.vstack((a, line)) #stack it
Solution 2 : the same but store the line in a pandas data.frame with a row index if conditions ok (very slow)
import csv
import numpy
import pandas
from tqdm import tqdm
date_first = '2008-11-01'
date_last = '2008-11-10'
row = 0 #row index
a = pandas.DataFrame(numpy.zeros((0,35)))#data is 35 columns
with open("data.csv", "r") as f:
for line in tqdm(f, unit = " lines per"):
line = line.split(sep = ";")
date_file = line[1][0:10]
if date_file>=date_first and date_file<=date_last :
a.loc[row] = line #store the line in the pd.data.frame at the position row
row = row + 1 #go to next row
Solution 3 : the same, but instead of storing the line somewhere, which is the main issue for me, keep an index for satisfying rows, and then open the csv with the rows i need (even slower, actually going through file to find the indexes is fast enough, the opening index's row is awfully slow)
import csv
import numpy
import pandas
from tqdm import tqdm
date_first = '2008-11-01'
date_last = '2008-11-10'
row = 0
index = list()
with open("data.csv", "r") as f:
f = csv.reader(f, delimiter = ";")
for line in tqdm(f, unit = " lines per"):
line = line.split(sep = ";")
date_file = line[1][0:10]
row = row + 1
if date_file>=date_first and date_file<=date_last :
index.append(row)
with open("data.csv") as f:
reader=csv.reader(f)
interestingrows=[row for idx, row in enumerate(reader) if idx in index]
The idea would be to keep only the data which satisfy the condition, here an extraction for a specific month. I do not understand where the problem is coming from, saving the data somewhere (vstack, or writing with in a pd.DF) is definitively an issue. I'm pretty sure i do something wrong but i'm not sure where/what.
The data is a csv with 35 columns and over 57M rows. Thanks for the reading
O.
Appends to dataframes and numpy arrays are very expensive because each append must copy the entire data to a new memory location. Instead, you can try reading the file in chunks, processing the data, and appending back out. Here I've picked a chunk size of 100,000 but you can obviously change this.
I don't know the column names of your CSV so I guessed at 'date_file'
. This should get you close:
import pandas as pd
date_first = '2008-11-01'
date_last = '2008-11-10'
df = pd.read_csv("data.csv", chunksize=100000)
for chunk in df:
chunk = chunk[(chunk['date_file'].str[:10] >= date_first)
& (chunk['date_file'].str[:10] <= date_last)]
chunk.to_csv('output.csv', mode='a')
Update 2024: Things have changed a lot since I answered this. The current approach would be to use polars
which can load data lazily. You would want to use scan_csv
to lazily load data that meets your criteria and then use sink_csv
for the output.
So, something like:
import polars as pl
df = (
pl.scan_csv("data.csv")
.filter(
pl.col("date_col").str.slice(0, length=10) >= date_first,
pl.col("date_col").str.slice(0, length=10) <= date_last
)
.sink_csv("output.csv")
)
That will automatically batch the data loading for you and stream it back out to a new file. Note, the parquet
file format is more compact and efficient for handling data these days, so it might be worth streaming it back out into that format - though, it's not human-readable.