I am working in databricks where I read a json file from s3 and need to convert it to a string. The json file has multiple layers to it. For the demo, lets say it looks like so:
{
"id": "123",
"details":[
{
"name": "Bob",
"address": "123 street"
},
{
"name": "Amy",
"address": "XYZ street"
}
],
"docType": "File",
"collections": ["a","b","c"]
}
Said json file is stored in s3 and so I read as so:
aws_s3_bucket = 'my_bucket'
mount_name = '/mnt/test'
source_url = 's3a://%s' %(aws_s3_bucket)
dbutils.fs.mount(source_url,mount_name)
file_path = "/dummyKey/dummyFile.json"
df = spark.read.option("multiline","true").json(mount_name + file_path).cache()
This returns a dataframe with 4 columns: id
,details
,docType
and collections
. So a column for each top level field in my json file.
I now would like to convert this dataframe into a json string i.e. the file that is in s3 but without formating.
I tried to make use of toJson()
function, as well as to_json
from Pandas lib. But in both cases I got the following error:
the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
which from what I understand means that it cannot do this as it is too complex.
My question is then as follows: how can I read a json file from s3 and convert it to a string? I suspect there is a way to directly read it in as a string without saving to dataframe but I just cannot find such function (provided it exists to begin with).
So, after some more research I didn't find anything that can achieve that. Instead, I resorted to read json file as text and concatenate rows of my df to produce json string. The code to achieve that is as follows:
def create_json_str(collection):
output_str = ""
for row in collection:
output_str = output_str + row[0]
return output_str.replace(" ", "")
aws_s3_bucket = 'my_bucket'
mount_name = '/mnt/test'
source_url = 's3a://%s' %(aws_s3_bucket)
dbutils.fs.mount(source_url,mount_name)
file_path = "/dummyKey/dummyFile.json"
df = spark.read.option("multiline","true").text(mount_name + file_path).cache()
df_collection = df.collect()
json_str = create_json_str(df_collection)
Not the best solution, I think using schemas & reading it as json would be better, but for my purposes it is enough.