I have a question about the best way to convert this JSON to a Dataframe:
JSON data:
{
"myschema": {
"accounts": {
"load_type": "daily",
"fields": {
"id": "nvarchar2",
"isdeleted": "number",
"master": "nvarchar2",
"name": "nvarchar2"
}
},
"customer": {
"load_type": "daily",
"fields": {
"id": "nvarchar2",
"accountid": "nvarchar2",
"usergroupid": "nvarchar2"
}
},
"resources": {
"load_type": "daily",
"fields": {
"id": "nvarchar2",
"isdeleted": "number",
"name": "nvarchar2",
"currency": "nvarchar2"
}
},
....
....
}
}
The result must be something like this:
TABLE |LOAD_TYPE |COLUMN |COLUMN_TYPE |
+-----------+-----------+-----------------+--------------+
| accounts |daily |id |NVARCHAR2 |
| accounts |daily |master |NVARCHAR2 |
| accounts |daily |name |NVARCHAR2 |
| customer |daily |id |NVARCHAR2 |
| customer |daily |accountid |NVARCHAR2 |
| customer |daily |usergroupid |NVARCHAR2 |
| resources |daily |id |NVARCHAR2 |
| resources |daily |name |NVARCHAR2 |
| resources |daily |currency |NVARCHAR2 |
+-----------+-----------+-----------------+--------------+
I tried the next code:
df2 = spark.read.option("multiLine", "true").json(json_s3_path)
df2.printSchema()
root
|-- mySchema: struct (nullable = true)
| |-- accounts: struct (nullable = true)
| | |-- FIELDS: struct (nullable = true)
.....
.....
and also the next code:
df3 = spark.read.format("json") \
.option("multiLine", True) \
.option("header",True) \
.option("inferschema",True) \
.load(json_s3_path) \
and the result for this is:
+----------------------------------------------------------------------------------------------------------------------------------------------------+
|mySchema |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
|{{{NVARCHAR2, NUMBER, NVARCHAR2, NVARCHAR2}, Delta}, {{NVARCHAR2, NVARCHAR2, NVARCHAR2}, Delta}, {{NVARCHAR2, NVARCHAR2, NUMBER, NVARCHAR2}, Delta}}|
+----------------------------------------------------------------------------------------------------------------------------------------------------+
You can convert a JSON object to a dataframe using pandas
.
Here's the code
import pandas as pd
json_data = {
"myschema": {
"accounts": {
"load_type": "daily",
"fields": {
"id": "nvarchar2",
"isdeleted": "number",
"master": "nvarchar2",
"name": "nvarchar2"
}
},
"customer": {
"load_type": "daily",
"fields": {
"id": "nvarchar2",
"accountid": "nvarchar2",
"usergroupid": "nvarchar2"
}
},
"resources": {
"load_type": "daily",
"fields": {
"id": "nvarchar2",
"isdeleted": "number",
"name": "nvarchar2",
"currency": "nvarchar2"
}
},
....
....
}
}
# Extract the information and flatten it into a list of rows
data = []
for table_name, table_data in json_data["myschema"].items():
load_type = table_data["load_type"]
fields = table_data["fields"]
for column_name, column_type in fields.items():
data.append([table_name, load_type, column_name, column_type.upper()])
# Create the DataFrame
df = pd.DataFrame(data, columns=["TABLE", "LOAD_TYPE", "COLUMN", "COLUMN_TYPE"])
print(df)