jsondataframeapache-sparkpysparkaws-glue

how concatenate a json key columns using coma as separator


i have a problem to resolve this:

Having a json file, like:

{
    "database": {
        "table_1": {
            "load_type": "Delta",
            "columns": {
                "col_t1_1": "NVARCHAR2",
                "col_t1_2": "NUMBER",
                "col_t1_3": "NVARCHAR2",
                "col_t1_4": "NVARCHAR2",
                "col_t1_5": "NVARCHAR2"
            }
        },
        "table_2": {
            "load_type": "Delta",
            "columns": {
                "col_t2_1": "NVARCHAR2",
                "col_t2_2": "NVARCHAR2",
                "col_t2_3": "NVARCHAR2",
                "col_t2_4": "NVARCHAR2"
            }
        }
    }
}

I try using explode to get the next result:

enter image description here

But is difficult and a little complex, is for that reasson i ask to someone who can help me to resolve this issue.

Best regards


Solution

  • Here's how you do it in scala you can likely translate to pyspark:

    Contents of json-on-one-line

    { "database": { "table_1": { "load_type": "Delta", "columns": { "col_t1_1": "NVARCHAR2", "col_t1_2": "NUMBER", "col_t1_3": "NVARCHAR2", "col_t1_4": "NVARCHAR2", "col_t1_5": "NVARCHAR2" } }, "table_2": { "load_type": "Delta", "columns": { "col_t2_1": "NVARCHAR2", "col_t2_2": "NVARCHAR2", "col_t2_3": "NVARCHAR2", "col_t2_4": "NVARCHAR2" } } } }
    

    Solution:

    
    
    # Read the JSON file
    jsonDF = spark.read.json("./json-on-one-line") # make it all on one line.
    
    # Extract all table names under "database"
    database_columns = jsonDF.selectExpr("database.*").columns
    
    allColumns = []
    for elem in database_columns:
         nested_columns = jsonDF.selectExpr("database."+elem+".columns.*").columns
         load_type = jsonDF.selectExpr("database."+elem+".load_type").first()[0]
         allColumns.append([elem, load_type,", ".join(nested_columns)])
    
    df_columns = spark.sparkContext.parallelize(allColumns).toDF(["table","load_type","ColumnAsString" ])
    
    
    df_columns.show()
    
    +-------+---------+--------------------+
    |  table|load_type|      ColumnAsString|
    +-------+---------+--------------------+
    |table_1|    Delta|col_t1_1, col_t1_...|
    |table_2|    Delta|col_t2_1, col_t2_...|
    +-------+---------+--------------------+