pyspark

changing json column into maptype in pyspark


I am trying to convert a JSON column in a PySpark DataFrame to a MapType using from_json. While using MapType directly works, using StructType results in null values. Here is my code:

from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, MapType, StructType, StructField
from pyspark.sql.functions import from_json

spark = SparkSession.builder.appName("example").getOrCreate()

data = [(1, '''{"a": "1"}''')]
df = spark.createDataFrame(data, ["key", "value"])
df.show()

# This line changes the data of the value column to MapType
df1 = df.withColumn('data', from_json(df.value, MapType(StringType(), StringType())))
df1.show()

# This line does not change the data of the value column to MapType
schema = StructType([StructField("a", MapType(StringType(), StringType()))])
df2 = df.withColumn('data', from_json(df.value, schema))
df2.show()

# Output of df1
# +---+----------+--------+
# |key|     value|    data|
# +---+----------+--------+
# |  1|{"a": "1"}|{a -> 1}|
# +---+----------+--------+

# Output of df2
# +---+----------+------+
# |key|     value|  data|
# +---+----------+------+
# |  1|{"a": "1"}|  null|
# +---+----------+------+

Why does from_json return null when using StructType but works fine with MapType? How can I correctly use StructType to parse the JSON column?


Solution

  • Your first change is:

    df1 = df.withColumn("data", F.from_json(df.value, T.MapType(T.StringType(), T.StringType())))
    

    Your second change is:

    df2 = df.withColumn("data", F.from_json(df.value, schema))
    

    While these lines look similar, the schemas are different:

    The first schema is a simple map, while the second schema is a struct containing a map. This difference is why the second approach doesn't work as expected. Your second schema would match a json schema like this one :

    {"a": {"b": "c"}}
    

    To achieve the desired result, you should use:

    schema = T.StructType([T.StructField("a", T.StringType())])
    

    This will give you the output:

    +---+----------+----+
    |key|     value|data|
    +---+----------+----+
    |  1|{"a": '1'}| {1}|
    +---+----------+----+
    

    Note

    I used the following imports:

    from pyspark.sql import functions as F, types as T