jsonscalaapache-sparkpysparkjsonparser

exploding the struct with no arrays pyspark


I have json data like

{
"labels1":
     {"A":1,"B":2, "C":3},
"labels2":
     {"A":1,"B":2, "C":3},
}

and I want 3 output columns that say tagname, keyname,value. The final output will be like

tagname,key,value
labels1,A,1
labels1,B,2
labels1,C,3
labels2,A,1
labels2,B,2
labels2,C,3

How can I achieve this usecase, also the keys A,B,C are just sample and there can be multiple optional fields. Thanks in advance and let me know if any more information is required.


Solution

  • Try with inbuilt pyspark functions for this case like stack and unnest the struct to add as new columns.

    Example:

    from pyspark.sql.functions import *
    json = """{"labels1":{"A":1,"B":2, "C":3},"labels2":{"A":1,"B":2, "C":3}}"""
    df = spark.read.json(sc.parallelize([json]), multiLine=True)
    df.select(expr("stack(2,'labels1',labels1,'labels2',labels2)")).\
      select(col("col0").alias("tagname"),col("col1.*")).\
      select("tagname",expr("stack(3,'A',A,'B',B,'C',C) as (key,value)")).show()
    
    #+-------+---+-----+
    #|tagname|key|value|
    #+-------+---+-----+
    #|labels1|  A|    1|
    #|labels1|  B|    2|
    #|labels1|  C|    3|
    #|labels2|  A|    1|
    #|labels2|  B|    2|
    #|labels2|  C|    3|
    #+-------+---+-----+
    

    Another way by using unpivot function:

    df.withColumn("n",lit(1)).\
      unpivot("n",["labels1", "labels2"],"new","new1").select(col("new").alias("tagname"),col("new1.*")).\
      unpivot("tagname",["A","B","C"],"key","value").\
      show()