I have a Spark data frame (df1
) with a particular schema, and I have another dataframe with the same columns, but different schema. I know how to do it column by column, but since I have a large set of columns, it would be quite lengthy. To keep the schema consistent across dataframes, I was wondering if I could be able to apply one schema to another data frame or creating a function that do the job.
Here is an example:
df1
# root
# |-- A: date (nullable = true)
# |-- B: integer (nullable = true)
# |-- C: string (nullable = true)
df2
# root
# |-- A: string (nullable = true)
# |-- B: string (nullable = true)
# |-- C: string (nullable = true)`
I want to copy apply the schema of df1
to df2
.
I tried this approach for one column. Given that I have a large number of columns, it would be quite a lengthy way to do it.
df2 = df2.withColumn("B", df2["B"].cast('int'))
Yes, its possible dynamically with dataframe.schema.fields
df2.select(*[(col(x.name).cast(x.dataType)) for x in df1.schema.fields])
Example:
from pyspark.sql.functions import *
df1 = spark.createDataFrame([('2022-02-02',2,'a')],['A','B','C']).withColumn("A",to_date(col("A")))
print("df1 Schema")
df1.printSchema()
#df1 Schema
#root
# |-- A: date (nullable = true)
# |-- B: long (nullable = true)
# |-- C: string (nullable = true)
df2 = spark.createDataFrame([('2022-02-02','2','a')],['A','B','C'])
print("df2 Schema")
df2.printSchema()
#df2 Schema
#root
# |-- A: string (nullable = true)
# |-- B: string (nullable = true)
# |-- C: string (nullable = true)
#
#casting the df2 columns by getting df1 schema using select clause
df3 = df2.select(*[(col(x.name).cast(x.dataType)) for x in df1.schema.fields])
df3.show(10,False)
print("df3 Schema")
df3.printSchema()
#+----------+---+---+
#|A |B |C |
#+----------+---+---+
#|2022-02-02|2 |a |
#+----------+---+---+
#df3 Schema
#root
# |-- A: date (nullable = true)
# |-- B: long (nullable = true)
# |-- C: string (nullable = true)
In this example I have df1 defined with Integer,date,long types
.
df2
is defined with string
type.
df3
is defined by using df2
as source data and attached df1 schema
.