pythonapache-sparkpysparkschemapyspark-schema

Copy the schema from one dataframe to another


I have a Spark data frame (df1) with a particular schema, and I have another dataframe with the same columns, but different schema. I know how to do it column by column, but since I have a large set of columns, it would be quite lengthy. To keep the schema consistent across dataframes, I was wondering if I could be able to apply one schema to another data frame or creating a function that do the job.

Here is an example:

df1
# root
#  |-- A: date (nullable = true)
#  |-- B: integer (nullable = true)
#  |-- C: string (nullable = true)

df2
# root
#  |-- A: string (nullable = true)
#  |-- B: string (nullable = true)
#  |-- C: string (nullable = true)`

I want to copy apply the schema of df1 to df2.

I tried this approach for one column. Given that I have a large number of columns, it would be quite a lengthy way to do it.

df2 = df2.withColumn("B", df2["B"].cast('int'))

Solution

  • Yes, its possible dynamically with dataframe.schema.fields

    df2.select(*[(col(x.name).cast(x.dataType)) for x in df1.schema.fields])

    Example:

    from pyspark.sql.functions import *
    df1 = spark.createDataFrame([('2022-02-02',2,'a')],['A','B','C']).withColumn("A",to_date(col("A")))
    print("df1 Schema")
    df1.printSchema()
    #df1 Schema
    #root
    # |-- A: date (nullable = true)
    # |-- B: long (nullable = true)
    # |-- C: string (nullable = true)
    
    df2 = spark.createDataFrame([('2022-02-02','2','a')],['A','B','C'])
    print("df2 Schema")
    df2.printSchema()
    #df2 Schema
    #root
    # |-- A: string (nullable = true)
    # |-- B: string (nullable = true)
    # |-- C: string (nullable = true)
    #
    
    #casting the df2 columns by getting df1 schema using select clause
    df3 = df2.select(*[(col(x.name).cast(x.dataType)) for x in df1.schema.fields])
    df3.show(10,False)
    print("df3 Schema")
    df3.printSchema()
    
    #+----------+---+---+
    #|A         |B  |C  |
    #+----------+---+---+
    #|2022-02-02|2  |a  |
    #+----------+---+---+
    
    #df3 Schema
    #root
    # |-- A: date (nullable = true)
    # |-- B: long (nullable = true)
    # |-- C: string (nullable = true)
    

    In this example I have df1 defined with Integer,date,long types.

    df2 is defined with string type.

    df3 is defined by using df2 as source data and attached df1 schema.