python json dataframe apache-spark pyspark

which is the best way to convert json into a dataframe?

I have a question about the best way to convert this JSON to a Dataframe:

JSON data:

{
    "myschema": {
        "accounts": {
            "load_type": "daily",
            "fields": {
                "id": "nvarchar2",
                "isdeleted": "number",
                "master": "nvarchar2",
                "name": "nvarchar2"
            }
        },
        "customer": {
            "load_type": "daily",
            "fields": {
                "id": "nvarchar2",
                "accountid": "nvarchar2",
                "usergroupid": "nvarchar2"
            }
        },
        "resources": {
            "load_type": "daily",
            "fields": {
                "id": "nvarchar2",
                "isdeleted": "number",
                "name": "nvarchar2",
                "currency": "nvarchar2"
            }
        },
        ....
        ....
    }
}

The result must be something like this:

TABLE       |LOAD_TYPE  |COLUMN           |COLUMN_TYPE   |
+-----------+-----------+-----------------+--------------+
| accounts  |daily      |id               |NVARCHAR2     |
| accounts  |daily      |master           |NVARCHAR2     |
| accounts  |daily      |name             |NVARCHAR2     |
| customer  |daily      |id               |NVARCHAR2     |
| customer  |daily      |accountid        |NVARCHAR2     |
| customer  |daily      |usergroupid      |NVARCHAR2     |
| resources |daily      |id               |NVARCHAR2     |
| resources |daily      |name             |NVARCHAR2     |
| resources |daily      |currency         |NVARCHAR2     |
+-----------+-----------+-----------------+--------------+

I tried the next code:

df2 = spark.read.option("multiLine", "true").json(json_s3_path)
df2.printSchema()

root
 |-- mySchema: struct (nullable = true)
 |    |-- accounts: struct (nullable = true)
 |    |    |-- FIELDS: struct (nullable = true)
.....
.....

and also the next code:

df3 = spark.read.format("json") \
          .option("multiLine", True) \
          .option("header",True) \
          .option("inferschema",True) \
          .load(json_s3_path) \

and the result for this is:

+----------------------------------------------------------------------------------------------------------------------------------------------------+
|mySchema                                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
|{{{NVARCHAR2, NUMBER, NVARCHAR2, NVARCHAR2}, Delta}, {{NVARCHAR2, NVARCHAR2, NVARCHAR2}, Delta}, {{NVARCHAR2, NVARCHAR2, NUMBER, NVARCHAR2}, Delta}}|
+----------------------------------------------------------------------------------------------------------------------------------------------------+

Solution

You can convert a JSON object to a dataframe using pandas. Here's the code

import pandas as pd

json_data = {
    "myschema": {
        "accounts": {
            "load_type": "daily",
            "fields": {
                "id": "nvarchar2",
                "isdeleted": "number",
                "master": "nvarchar2",
                "name": "nvarchar2"
            }
        },
        "customer": {
            "load_type": "daily",
            "fields": {
                "id": "nvarchar2",
                "accountid": "nvarchar2",
                "usergroupid": "nvarchar2"
            }
        },
        "resources": {
            "load_type": "daily",
            "fields": {
                "id": "nvarchar2",
                "isdeleted": "number",
                "name": "nvarchar2",
                "currency": "nvarchar2"
            }
        },
        ....
        ....
    }
}

# Extract the information and flatten it into a list of rows
data = []

for table_name, table_data in json_data["myschema"].items():
    load_type = table_data["load_type"]
    fields = table_data["fields"]
    for column_name, column_type in fields.items():
        data.append([table_name, load_type, column_name, column_type.upper()])

# Create the DataFrame
df = pd.DataFrame(data, columns=["TABLE", "LOAD_TYPE", "COLUMN", "COLUMN_TYPE"])

print(df)