I have a dataframe which has 2 columns" "ID" and "input_array" (values are JSON arrays).
ID input_array
1 [ {“A”:300, “B”:400}, { “A”:500,”B”: 600} ]
2 [ {“A”: 800, “B”: 900} ]
Output that I need:
ID A B
1 300 400
1 500 600
2 800 900
I tried from_json
, explode
functions. But data type mismatch error is coming for array columns.
I have 2 interpretations of what input (column "input_array") data types you have.
If the type of input column is a string...
df = spark.createDataFrame(
[(1, '[ {"A":300, "B":400}, { "A":500,"B": 600} ]'),
(2, '[ {"A": 800, "B": 900} ]')],
['ID', 'input_array'])
df.printSchema()
# root
# |-- ID: long (nullable = true)
# |-- input_array: string (nullable = true)
...you can use from_json
to extract Spark structure from JSON string and then inline
to explode the resulting array of structs into columns.
df = df.selectExpr(
"ID",
"inline(from_json(input_array, 'array<struct<A:long,B:long>>'))"
)
df.show()
# +---+---+---+
# | ID| A| B|
# +---+---+---+
# | 1|300|400|
# | 1|500|600|
# | 2|800|900|
# +---+---+---+
If the type of input column is an array of strings...
df = spark.createDataFrame(
[(1, [ '{"A":300, "B":400}', '{ "A":500,"B": 600}' ]),
(2, [ '{"A": 800, "B": 900}' ])],
['ID', 'input_array'])
df.printSchema()
# root
# |-- ID: long (nullable = true)
# |-- input_array: array (nullable = true)
# | |-- element: string (containsNull = true)
...you can first use explode
to move every array's element into rows thus resulting in a column of string type, then use from_json
to create Spark data types from the strings and finally expand *
the structs into columns.
from pyspark.sql import functions as F
df = df.withColumn('input_array', F.explode('input_array'))
df = df.withColumn('input_array', F.from_json('input_array', 'struct<A:long,B:long>'))
df = df.select('ID', 'input_array.*')
df.show()
# +---+---+---+
# | ID| A| B|
# +---+---+---+
# | 1|300|400|
# | 1|500|600|
# | 2|800|900|
# +---+---+---+