google-bigquerypyarrow

Uploading multiple local parquet files to bq


I have ~100 local parquet files that must be loaded to bigquery as a single table. All files are in a single directory and have identical schema. I have tried the following

  1. Combine locally then upload to bq
#in python
import pyarrow.dataset as ds
import pyarrow.parquet as pq

input_dir = '/dir/with/parquet/files'
dataset = ds.dataset(input_dir, format="parquet")
pq.write_table(dataset.to_table(), "big_file.pqt")

The issue here is that the files are around 300gigs total and we run out of memory. If this can be done in chunks maybe, but I don't know if this is possible

  1. load all directory as single file using command line tool
#in command line
bq load --source_format=PARQUET my-project:mydataset.biqfile \
 "/dir/with/parquet/files/*.pqt"

This throws exceptions. I have tried several different flavors of the above, such as a) removing the quotes from the last line and b) changing the last line to /dir/with/parquet/files/\*.pqt. All have thrown exceptions. I am a total beginner at scripting so maybe there is something obvious missing here.


Solution

  • I wrote a script that loads each file sequentially into the desired table. Here is the code:

    import os
    import subprocess
    import argparse
    import sys
    
    project = 'my-project' 
    dataset = 'my_dataset'
    table =   'my-table'
    
    #create an empty table
    create_table_cmd = f'bq mk --table {project}:{dataset}.{table}'
    print(f"Creating table: {create_table_cmd}")
    subprocess.run(create_table_cmd, shell=True, check=True)
    
    # Load data into the table
    for file in os.listdir(parquet_dir):
        if file.endswith(('.parquet', '.pqt')):
            file_path = os.path.join(parquet_dir, file)
            load_cmd = f'bq load --source_format=PARQUET --autodetect {dataset}.{table} "{file_path}"'
                print(f"Loading file: {load_cmd}")
                subprocess.run(load_cmd, shell=True, check=True)