pythonjsondataframeapache-sparkpyspark

which is the best way to convert json into a dataframe?


I have a question about the best way to convert this JSON to a Dataframe:

JSON data:

{
    "myschema": {
        "accounts": {
            "load_type": "daily",
            "fields": {
                "id": "nvarchar2",
                "isdeleted": "number",
                "master": "nvarchar2",
                "name": "nvarchar2"
            }
        },
        "customer": {
            "load_type": "daily",
            "fields": {
                "id": "nvarchar2",
                "accountid": "nvarchar2",
                "usergroupid": "nvarchar2"
            }
        },
        "resources": {
            "load_type": "daily",
            "fields": {
                "id": "nvarchar2",
                "isdeleted": "number",
                "name": "nvarchar2",
                "currency": "nvarchar2"
            }
        },
        ....
        ....
    }
}

The result must be something like this:

TABLE       |LOAD_TYPE  |COLUMN           |COLUMN_TYPE   |
+-----------+-----------+-----------------+--------------+
| accounts  |daily      |id               |NVARCHAR2     |
| accounts  |daily      |master           |NVARCHAR2     |
| accounts  |daily      |name             |NVARCHAR2     |
| customer  |daily      |id               |NVARCHAR2     |
| customer  |daily      |accountid        |NVARCHAR2     |
| customer  |daily      |usergroupid      |NVARCHAR2     |
| resources |daily      |id               |NVARCHAR2     |
| resources |daily      |name             |NVARCHAR2     |
| resources |daily      |currency         |NVARCHAR2     |
+-----------+-----------+-----------------+--------------+

I tried the next code:

df2 = spark.read.option("multiLine", "true").json(json_s3_path)
df2.printSchema()

root
 |-- mySchema: struct (nullable = true)
 |    |-- accounts: struct (nullable = true)
 |    |    |-- FIELDS: struct (nullable = true)
.....
.....

and also the next code:

df3 = spark.read.format("json") \
          .option("multiLine", True) \
          .option("header",True) \
          .option("inferschema",True) \
          .load(json_s3_path) \

and the result for this is:

+----------------------------------------------------------------------------------------------------------------------------------------------------+
|mySchema                                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
|{{{NVARCHAR2, NUMBER, NVARCHAR2, NVARCHAR2}, Delta}, {{NVARCHAR2, NVARCHAR2, NVARCHAR2}, Delta}, {{NVARCHAR2, NVARCHAR2, NUMBER, NVARCHAR2}, Delta}}|
+----------------------------------------------------------------------------------------------------------------------------------------------------+

Solution

  • You can convert a JSON object to a dataframe using pandas. Here's the code

    import pandas as pd
    
    json_data = {
        "myschema": {
            "accounts": {
                "load_type": "daily",
                "fields": {
                    "id": "nvarchar2",
                    "isdeleted": "number",
                    "master": "nvarchar2",
                    "name": "nvarchar2"
                }
            },
            "customer": {
                "load_type": "daily",
                "fields": {
                    "id": "nvarchar2",
                    "accountid": "nvarchar2",
                    "usergroupid": "nvarchar2"
                }
            },
            "resources": {
                "load_type": "daily",
                "fields": {
                    "id": "nvarchar2",
                    "isdeleted": "number",
                    "name": "nvarchar2",
                    "currency": "nvarchar2"
                }
            },
            ....
            ....
        }
    }
    
    # Extract the information and flatten it into a list of rows
    data = []
    
    for table_name, table_data in json_data["myschema"].items():
        load_type = table_data["load_type"]
        fields = table_data["fields"]
        for column_name, column_type in fields.items():
            data.append([table_name, load_type, column_name, column_type.upper()])
    
    # Create the DataFrame
    df = pd.DataFrame(data, columns=["TABLE", "LOAD_TYPE", "COLUMN", "COLUMN_TYPE"])
    
    print(df)