I have a json file that looks like this:
test= {'kpiData': [{'date': '2020-06-03 10:05',
'a': 'MINIMUMINTERVAL',
'b': 0.0,
'c': True},
{'date': '2020-06-03 10:10',
'a': 'MINIMUMINTERVAL',
'b': 0.0,
'c': True},
{'date': '2020-06-03 10:15',
'a': 'MINIMUMINTERVAL',
'b': 0.0,
'c': True},
{'date': '2020-06-03 10:20',
'a': 'MINIMUMINTERVAL',
'b': 0.0,}
]}
I want to transfer it to a dataframe object, like this:
rdd = sc.parallelize([test])
jsonDF = spark.read.json(rdd)
This results in a corrupted record. From my understanding the reason for this is, that True
and False
can't be entries in Python. So I need to tranform these entries prior to the spark.read.json()
(to TRUE, true or "True"). test is a dict and rdd is a pyspark.rdd.RDD object. For a datframe object the transformation is pretty straigth forward, but I didn't find a solution for these objects.
spark.read.json
expects an RDD of JSON strings, not an RDD of Python dictionaries. If you convert the dictionary to a JSON string, you should be able to read that into a dataframe:
import json
df = spark.read.json(sc.parallelize([json.dumps(test)]))
Another possible way is to read in the dictionary using spark.createDataFrame
:
df = spark.createDataFrame([test])
which will give a different schema with maps instead of structs.