pandasparquetfastparquet

Convert multiple CSVs to single partitioned parquet dataset


I have a set of CSV files, each for one year of data, with YEAR column in each. I want to convert them into single parquet dataset, partitioned by year, for later use in pandas. The problem is that dataframe with all years combined is too large to fit in memory. Is it possible to write parquet partitions iteratively, one by one?

I am using fastparquet as engine.

Simplified code example. This code blows up memory usage and crashes.

df = []
for year in range(2000, 2020):
  df.append(pd.read_csv(f'{year}.csv'))
df = pd.concat(df)
df.to_parquet('all_years.pq', partition_cols=['YEAR'])

I tried to write years one by one, like so.

for year in range(2000, 2020):
  df = pd.read_csv(f'{year}.csv')
  df.to_parquet('all_years.pq', partition_cols=['YEAR'])

The data files are all there in their respective YEAR=XXXX directories, but when I try to read such a dataset, I only get the last year. Maybe it is possible to fix the parquet metadata after writing separate partitions?


Solution

  • I think I found a way to do it using fastparquet.writer.merge() function. Parquet files are written one by one for each year, leaving out the YEAR column and giving them appropriate names, and then the merge() function creates top level _metadata file.

    The code below is a gist, as I leave out many details from my concrete use case.

    years = range(2000, 2020)
    for year in years:
      df = pd.read_csv(f'{year}.csv').drop(columns=['YEAR'])
      df.to_parquet(f'all_years.pq/YEAR={year}')
    fastparquet.writer.merge([f'all_years.pq/YEAR={y}' for y in years])
    
    df_all = pd.read_parquet('all_years.pq')