pythonazure-synapseazure-notebooks

How to build JSON hierarchy tree using python?


I am currently working with a CSV file generated from a SQL query. This file contains all the fields and calculations necessary for my analysis. My current task is to transform this data into a JSON file following a specific hierarchical structure, see below.

I'm using Synapse Spark notebooks to try and format the data to suit the JSON format required by a custom Function App. There is one foreign key relationship which is how we're linking the Data2 array but they are separate arrays.

The output for Data1 is formatted correctly but I'm having an issue with the formatting for Data2. Unfortunately I can't provide the source dataset.

The python I have so far:

df = spark.read.option('header', 'true') 
               .option('delimiter', ',') 
               .csv(read_path)

df2 = (
    df.groupBy("Data1Id")
    .agg(collect_set('PreviousData1').alias('PreviousData1'))
)

JSON format needed:


{
   "Data1":[
      {
         "Data1Id":"id1",
         "PreviousData1":[
            {
               "Id":"PreviousId",
               "PreviousShifts":[
               ]
            }
         ],
         "FutureData1":[
            {
               "Id":"futureId",
               "FutureData1":[
                  
               ],
               "PreviousData1":[
               ]
            },
         ]
      }
   ],
 "Data2":[
      {
         "Data2Id":"id2",
         "Function2":[
            {
               "FunctionName":"function",
               "Value":"3"
            },
         ]
      }
   ]
}

As explained above, I am trying to format my source data into JSON and am after some assistance with the best way to do this using Python. Thanks.


Solution

  • You can use the code below:

    from pyspark.sql import functions as F
    
    data1_df = (
        df.groupBy("Data1Id")
        .agg(
            F.collect_set(
                F.struct(
                    F.col("PreviousId").alias("Id"),
                    F.array().alias("PreviousShifts")
                )
            ).alias("PreviousData1"),
            F.collect_set(
                F.struct(
                    F.col("FutureId").alias("Id"),
                    F.array().alias("FutureData1"),
                    F.array().alias("PreviousData1")
                )
            ).alias("FutureData1")
        )
    )
    
    data2_df = (
        df.groupBy("Data2Id", "FunctionName")
        .agg(F.collect_list(F.struct(F.col("FunctionName"), F.col("Value"))).alias("Function2"))
    )
    
    import json
    result = {}
    result["Data1"] = [json.loads(data1_df.toJSON().collect()[0])]
    result["Data2"] = [json.loads(data2_df.toJSON().collect()[0])]
    result
    

    Output:

    {
        'Data1': [{
            'Data1Id': 'id1',
            'PreviousData1': [
                {'Id': 'PreviousId2', 'PreviousShifts': []},
                {'Id': 'PreviousId3', 'PreviousShifts': []},
                {'Id': 'PreviousId1', 'PreviousShifts': []}
            ],
            'FutureData1': [
                {'Id': 'FutureId1', 'FutureData1': [], 'PreviousData1': []},
                {'Id': 'FutureId2', 'FutureData1': [], 'PreviousData1': []}
            ]
        }],
        'Data2': [{
            'Data2Id': 'id2',
            'FunctionName': 'function1',
            'Function2': [{'FunctionName': 'function1', 'Value': 3}]
        }]
    }
    

    Enter image description here

    However, you may need to modify the code according to your data. If you provide a schema or sample data, it will be helpful for reproducing the issue.