I'm trying to flatten JSON arrays object (no files .json) in Dask dataframes, because I have a lot of data and my RAM is consumed by the processes are running constantly, so I need a solution in a parallel form.
That's the JSON I have:
[ {
"id": "0001",
"name": "Stiven",
"location": [{
"country": "Colombia",
"department": "Choco",
"city": "Quibdo"
}, {
"country": "Colombia",
"department": "Antioquia",
"city": "Medellin"
}, {
"country": "Colombia",
"department": "Cundinamarca",
"city": "Bogota"
}
]
}, {
"id": "0002",
"name": "Jhon Jaime",
"location": [{
"country": "Colombia",
"department": "Valle del Cauca",
"city": "Cali"
}, {
"country": "Colombia",
"department": "Putumayo",
"city": "Mocoa"
}, {
"country": "Colombia",
"department": "Arauca",
"city": "Arauca"
}
]
}, {
"id": "0003",
"name": "Francisco",
"location": [{
"country": "Colombia",
"department": "Atlantico",
"city": "Barranquilla"
}, {
"country": "Colombia",
"department": "Bolivar",
"city": "Cartagena"
}, {
"country": "Colombia",
"department": "La Guajira",
"city": "Riohacha"
}
]
}
]
That's the dataframe I have:
index id name location
0 0001 Stiven [{'country':'Colombia', 'department': 'Choco', 'city': 'Quibdo'}, {'country':'Colombia', 'department': 'Antioquia', 'city': 'Medellin'}, {'country':'Colombia', 'department': 'Cundinamarca', 'city': 'Bogota'}]
1 0002 Jhon Jaime [{'country':'Colombia', 'department': 'Valle del Cauca', 'city': 'Cali'}, {'country':'Colombia', 'department': 'Putumayo', 'city': 'Mocoa'}, {'country':'Colombia', 'department': 'Arauca', 'city': 'Arauca'}]
2 0003 Francisco [{'country':'Colombia', 'department': 'Atlantico', 'city': 'Barranquilla'}, {'country':'Colombia', 'department': 'Bolivar', 'city': 'Cartagena'}, {'country':'Colombia', 'department': 'La Guajira', 'city': 'Riohacha'}]
I need to convert to dataframe per id something like this:
index id name country department city
0 0001 Stiven Colombia Choco Quibdo
1 0001 Stiven Colombia Antioquia Medellin
2 0001 Stiven Colombia Cundinamarca Bogota
3 0002 Jhon Jaime Colombia Valle del Cauca Cali
4 0002 Jhon Jaime Colombia Putumayo Mocoa
5 0002 Jhon Jaime Colombia Arauca Arauca
6 0003 Francisco Colombia Atlantico Barranquilla
7 0003 Francisco Colombia Bolivar Cartagena
8 0003 Francisco Colombia La Guajira Riohacha
All process must be in parallel with Dask. Any recommendation?
Thanks in advance.
I recommend solving this problem first with Pandas dataframes and then using the .map_partitions
function to apply that function to all Pandas-partitions within the Dask dataframe.