I have ~100 local parquet files that must be loaded to bigquery as a single table. All files are in a single directory and have identical schema. I have tried the following
#in python
import pyarrow.dataset as ds
import pyarrow.parquet as pq
input_dir = '/dir/with/parquet/files'
dataset = ds.dataset(input_dir, format="parquet")
pq.write_table(dataset.to_table(), "big_file.pqt")
The issue here is that the files are around 300gigs total and we run out of memory. If this can be done in chunks maybe, but I don't know if this is possible
#in command line
bq load --source_format=PARQUET my-project:mydataset.biqfile \
"/dir/with/parquet/files/*.pqt"
This throws exceptions. I have tried several different flavors of the above, such as a) removing the quotes from the last line and b) changing the last line to
/dir/with/parquet/files/\*.pqt
. All have thrown exceptions. I am a total beginner at scripting so maybe there is something obvious missing here.
I wrote a script that loads each file sequentially into the desired table. Here is the code:
import os
import subprocess
import argparse
import sys
project = 'my-project'
dataset = 'my_dataset'
table = 'my-table'
#create an empty table
create_table_cmd = f'bq mk --table {project}:{dataset}.{table}'
print(f"Creating table: {create_table_cmd}")
subprocess.run(create_table_cmd, shell=True, check=True)
# Load data into the table
for file in os.listdir(parquet_dir):
if file.endswith(('.parquet', '.pqt')):
file_path = os.path.join(parquet_dir, file)
load_cmd = f'bq load --source_format=PARQUET --autodetect {dataset}.{table} "{file_path}"'
print(f"Loading file: {load_cmd}")
subprocess.run(load_cmd, shell=True, check=True)