[SOLVED] how to efficiently read pq files

how to efficiently read pq files - Python

I have a list of files with .pq extension, whose names are stored in a list. My intention is to read these files, filter them based on pandas, and then merge them into a single pandas data frame.

Since there are thousands of files, the code currently runs super inefficiently. The biggest bottleneck is where I read the pq file. During the experiments, I commented out the filtering part. I've tried three different ways as shown below, however, it takes 1.5 seconds to reach each file which is quite slow. Are there alternative ways that I can perform these operations?

from tqdm import tqdm
from fastparquet import ParquetFile
import pandas as pd 
import pyarrow.parquet as pq

files = [.....]

#First way
for file in tqdm(files ):
    temp = pd.read_parquet(file)
    #filter temp and append 

#Second way
for file in tqdm(files):
    temp = ParquetFile(file).to_pandas()
    # filter temp and append

#Third way

for file in tqdm(files):
    temp = pq.read_table(source=file).to_pandas()
    # filter temp and append

Each line when I read file inside the for loop, it takes quite bit of a long time. For 24 files, I spend 28 seconds.

 24/24 [00:28<00:00,  1.19s/it]

 24/24 [00:25<00:00,  1.08s/it]

One sample file is on average 90MB that corresponds to 667858 rows and 48 columns. Data type is all numerical (i.e. float64). The number of rows may vary, but the number of columns remains the same.

Solution

Read multiple parquet files(partitions) at once into pyarrow.parquet.ParquetDataset which accepts a directory name, single file name, or list of file names and conveniently allows filtering of scanned data:

import pyarrow.parquet as pq

dataset = pq.ParquetDataset(your_files,
                            use_legacy_dataset=False,
                            filters=[('columnName', 'in', filterList)])
df = dataset.read(use_threads=True).to_pandas()