i have a problem to resolve this:
Having a json file, like:
{
"database": {
"table_1": {
"load_type": "Delta",
"columns": {
"col_t1_1": "NVARCHAR2",
"col_t1_2": "NUMBER",
"col_t1_3": "NVARCHAR2",
"col_t1_4": "NVARCHAR2",
"col_t1_5": "NVARCHAR2"
}
},
"table_2": {
"load_type": "Delta",
"columns": {
"col_t2_1": "NVARCHAR2",
"col_t2_2": "NVARCHAR2",
"col_t2_3": "NVARCHAR2",
"col_t2_4": "NVARCHAR2"
}
}
}
}
I try using explode to get the next result:
But is difficult and a little complex, is for that reasson i ask to someone who can help me to resolve this issue.
Best regards
Here's how you do it in scala you can likely translate to pyspark:
Contents of json-on-one-line
{ "database": { "table_1": { "load_type": "Delta", "columns": { "col_t1_1": "NVARCHAR2", "col_t1_2": "NUMBER", "col_t1_3": "NVARCHAR2", "col_t1_4": "NVARCHAR2", "col_t1_5": "NVARCHAR2" } }, "table_2": { "load_type": "Delta", "columns": { "col_t2_1": "NVARCHAR2", "col_t2_2": "NVARCHAR2", "col_t2_3": "NVARCHAR2", "col_t2_4": "NVARCHAR2" } } } }
Solution:
# Read the JSON file
jsonDF = spark.read.json("./json-on-one-line") # make it all on one line.
# Extract all table names under "database"
database_columns = jsonDF.selectExpr("database.*").columns
allColumns = []
for elem in database_columns:
nested_columns = jsonDF.selectExpr("database."+elem+".columns.*").columns
load_type = jsonDF.selectExpr("database."+elem+".load_type").first()[0]
allColumns.append([elem, load_type,", ".join(nested_columns)])
df_columns = spark.sparkContext.parallelize(allColumns).toDF(["table","load_type","ColumnAsString" ])
df_columns.show()
+-------+---------+--------------------+
| table|load_type| ColumnAsString|
+-------+---------+--------------------+
|table_1| Delta|col_t1_1, col_t1_...|
|table_2| Delta|col_t2_1, col_t2_...|
+-------+---------+--------------------+