I am trying to convert a JSON column in a PySpark DataFrame to a MapType using from_json
. While using MapType
directly works, using StructType
results in null values. Here is my code:
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, MapType, StructType, StructField
from pyspark.sql.functions import from_json
spark = SparkSession.builder.appName("example").getOrCreate()
data = [(1, '''{"a": "1"}''')]
df = spark.createDataFrame(data, ["key", "value"])
df.show()
# This line changes the data of the value column to MapType
df1 = df.withColumn('data', from_json(df.value, MapType(StringType(), StringType())))
df1.show()
# This line does not change the data of the value column to MapType
schema = StructType([StructField("a", MapType(StringType(), StringType()))])
df2 = df.withColumn('data', from_json(df.value, schema))
df2.show()
# Output of df1
# +---+----------+--------+
# |key| value| data|
# +---+----------+--------+
# | 1|{"a": "1"}|{a -> 1}|
# +---+----------+--------+
# Output of df2
# +---+----------+------+
# |key| value| data|
# +---+----------+------+
# | 1|{"a": "1"}| null|
# +---+----------+------+
Why does from_json
return null when using StructType
but works fine with MapType
? How can I correctly use StructType
to parse the JSON column?
Your first change is:
df1 = df.withColumn("data", F.from_json(df.value, T.MapType(T.StringType(), T.StringType())))
Your second change is:
df2 = df.withColumn("data", F.from_json(df.value, schema))
While these lines look similar, the schemas are different:
T.MapType(T.StringType(), T.StringType())
T.StructType([T.StructField("a", T.MapType(T.StringType(), T.StringType()))])
The first schema is a simple map, while the second schema is a struct containing a map. This difference is why the second approach doesn't work as expected. Your second schema would match a json schema like this one :
{"a": {"b": "c"}}
To achieve the desired result, you should use:
schema = T.StructType([T.StructField("a", T.StringType())])
This will give you the output:
+---+----------+----+
|key| value|data|
+---+----------+----+
| 1|{"a": '1'}| {1}|
+---+----------+----+
I used the following imports:
from pyspark.sql import functions as F, types as T