I am using Python 3 and pandas(pd.read_csv) to read the files. There are no headers and the separator is ' |, | '. Also, the files are not .csv files and the operating system is CentOS.
There are 30,000 files in a folder with a total size of 10GB. Each file has about 50-100 rows and 1500 columns. I read each csv file (using read_csv) do some operations on it and store them in a list via a for loop. At the end of the process I have a list of dataframes. I was wondering how to speedup the process. Only 10 of the columns are relevant and so I use the argument usecols to filter. The cell inputs are strings so I convert them to float using df.astype(float).
Note that I have to do my operations on each of the files separately and only then append them all together.
I tried to use modin but it led to a mutliple decrease in speed. Also using modin leads to the indexes in each dataframe being repeated multiple times which didn't happen with normal pandas.
One way of doing this is using Dask delayed. The problem with python and pandas is that it will do everything sequentially which could really slow down your application, especially with a mix of IO intensive and CPU intensive tasks. With Dask you can parallelize the reading and processing of your data, one way I would go about doing this is with the following.
from dask.delayed import delayed
import dask.dataframe as dd
import pandas as pd
file_names = () # Generator with filenames, create your own generator here
@delayed
def read_data(file_name):
return pd.read_csv(file_name)
@delayed
def process(df):
# Do the stuff here
return df
data = [process(read_data(file_name)) for file_name in file_names]
data = dd.compute(data)
print(data)