python apache-spark pyspark schema pyspark-schema

Copy the schema from one dataframe to another

I have a Spark data frame (df1) with a particular schema, and I have another dataframe with the same columns, but different schema. I know how to do it column by column, but since I have a large set of columns, it would be quite lengthy. To keep the schema consistent across dataframes, I was wondering if I could be able to apply one schema to another data frame or creating a function that do the job.

Here is an example:

df1
# root
#  |-- A: date (nullable = true)
#  |-- B: integer (nullable = true)
#  |-- C: string (nullable = true)

df2
# root
#  |-- A: string (nullable = true)
#  |-- B: string (nullable = true)
#  |-- C: string (nullable = true)`

I want to copy apply the schema of df1 to df2.

I tried this approach for one column. Given that I have a large number of columns, it would be quite a lengthy way to do it.

df2 = df2.withColumn("B", df2["B"].cast('int'))

Solution

Yes, its possible dynamically with dataframe.schema.fields

df2.select(*[(col(x.name).cast(x.dataType)) for x in df1.schema.fields])

Example:

from pyspark.sql.functions import *
df1 = spark.createDataFrame([('2022-02-02',2,'a')],['A','B','C']).withColumn("A",to_date(col("A")))
print("df1 Schema")
df1.printSchema()
#df1 Schema
#root
# |-- A: date (nullable = true)
# |-- B: long (nullable = true)
# |-- C: string (nullable = true)

df2 = spark.createDataFrame([('2022-02-02','2','a')],['A','B','C'])
print("df2 Schema")
df2.printSchema()
#df2 Schema
#root
# |-- A: string (nullable = true)
# |-- B: string (nullable = true)
# |-- C: string (nullable = true)
#

#casting the df2 columns by getting df1 schema using select clause
df3 = df2.select(*[(col(x.name).cast(x.dataType)) for x in df1.schema.fields])
df3.show(10,False)
print("df3 Schema")
df3.printSchema()

#+----------+---+---+
#|A         |B  |C  |
#+----------+---+---+
#|2022-02-02|2  |a  |
#+----------+---+---+

#df3 Schema
#root
# |-- A: date (nullable = true)
# |-- B: long (nullable = true)
# |-- C: string (nullable = true)

In this example I have df1 defined with Integer,date,long types.

df2 is defined with string type.

df3 is defined by using df2 as source data and attached df1 schema.