I want to parse a JSON request and create multiple columns out of it in pyspark as follows:
{
"ID": "abc123",
"device": "mobile",
"Ads": [
{
"placement": "topright",
"Adlist": [
{
"name": "ad1",
"subtype": "placeholder1",
"category": "socialmedia",
},
{
"name": "ad2",
"subtype": "placeholder2",
"category": "media",
},
{
"name": "ad3",
"subtype": "placeholder3",
"category": "printingpress",
}
]
},
{
"Placement": "bottomleft",
"Adlist": [
{
"name": "ad4",
"subtype": "placeholder4",
"category": "socialmedia",
},
{
"name": "ad5",
"subtype": "placeholder5",
"category": "media",
},
{
"name": "ad6",
"subtype": "placeholder6",
"category": "printingpress",
}
]
}
]
}
I tried the following:
from pyspark import SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
rdd = sc.parallelize([sample_json])
df = spark.read.option('multiline', "true").json(rdd)
df.printSchema()
root
|-- Ads: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Adlist: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- category: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- subtype: string (nullable = true)
| | |-- placement: string (nullable = true)
|-- ID: string (nullable = true)
|-- device: string (nullable = true)
The resulting dataframe looks like this:
+--------------------+------+------+
| Ads| ID|device|
+--------------------+------+------+
|[{[{socialmedia, ...|abc123|mobile|
+--------------------+------+------+
However, the output I am looking for is as follows:
Id | device | placement | name | subtype | category |
-------------------------------------------------------------------
abc123| mobile | topright | ad1 | placeholder1 | socialmedia |
abc123| mobile | topright | ad2 | placeholder2 | media |
abc123| mobile | topright | ad3 | placeholder3 | printingpress |
abc123| mobile |bottomleft | ad4 | placeholder4 | socialmedia |
abc123| mobile |bottomleft | ad5 | placeholder5 | media |
abc123| mobile |bottomleft | ad6 | placeholder6 | printingpress |
A combination of explode and ".*" will do the trick:
df = df.select(col("Id"), col("device"), explode(col("Ads")).alias("Ads")) \
.select("Id", "device", "Ads.*") \
.select(col("Id"), col("device"), explode(col("Adlist")).alias("Adlist"), col("placement")) \
.select("Id", "device", "placement", "Adlist.*")
df.show()
+------+------+---------+-------------+----+------------+
| Id|device|placement| category|name| subtype|
+------+------+---------+-------------+----+------------+
|abc123|mobile| topright| socialmedia| ad1|placeholder1|
|abc123|mobile| topright| media| ad2|placeholder2|
|abc123|mobile| topright|printingpress| ad3|placeholder3|
|abc123|mobile| null| socialmedia| ad4|placeholder4|
|abc123|mobile| null| media| ad5|placeholder5|
|abc123|mobile| null|printingpress| ad6|placeholder6|
+------+------+---------+-------------+----+------------+
FYI: