apache-sparkpysparkapache-spark-sqlparquet

Attach description of columns in Apache Spark using parquet format


I read a parquet with :

df = spark.read.parquet(file_name)

And get the columns with:

df.columns

And returns a list of columns ['col1', 'col2', 'col3']

I read that parquet format is able to store some metadata in the file.

Is there a way to store and read extra metadata, for example, attach a human description of what is each column?

Thanks.


Solution

  • As of 2024 and Spark 3, Spark automatically reads and writes column descriptions in parquet files.

    Here is a minimal example using PySpark demonstrating it. (The commented lines are the output printed by the program)

    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder.getOrCreate()
    
    user_df = spark.sql("SELECT 'John' as first_name, 'Doe' as last_name")
    
    user_df = user_df.withMetadata("first_name", {"comment": "The user's first name"})
    user_df = user_df.withMetadata("last_name", {"comment": "The user's last name"})
    
    for field in user_df.schema.fields:
        print(field.name, field.metadata)
    
    # first_name {'comment': "The user's first name"}
    # last_name {'comment': "The user's last name"}
    
    user_df.write.mode("overwrite").parquet("user")
    
    user_df_2 = spark.read.parquet("user")
    
    for field in user_df_2.schema.fields:
        print(field.name, field.metadata)
    
    # first_name {'comment': "The user's first name"}
    # last_name {'comment': "The user's last name"}