bashpostgresqlhdfsparquet

Is there a way to directly insert data from a parquet file into PostgreSQL database?


I'm trying to restore some historic backup files that saved in parquet format, and I want to read from them once and write the data into a PostgreSQL database.

I know that backup files saved using spark, but there is a strict restriction for me that I cant install spark in the DB machine or read the parquet file using spark in a remote device and write it to the database using spark_df.write.jdbc. Everything needs to happen on the DB machine and in the absence of spark and Hadoop only using Postgres and Bash scripting.

my files structure is something like:

foo/
    foo/part-00000-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
    foo/part-00001-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
    foo/part-00002-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
    ..
    ..

I expect to read data and schema from each parquet folder like foo, create a table using that schema and write the data into the shaped table, only using bash and Postgres CLI.


Solution

  • You can using spark and converting parquet files to csv format, then moving the files to DB machine and import them by any tools.

    spark.read.parquet("...").write.csv("...")
    
    import pandas as pd
    df = pd.read_csv('mypath.csv')
    df.columns = [c.lower() for c in df.columns] #postgres doesn't like capitals or spaces
    
    from sqlalchemy import create_engine
    engine = create_engine('postgresql://username:password@localhost:5432/dbname')
    
    df.to_sql("my_table_name", engine)