jsonpandaspysparkdatabricks

How to read a complex JSON file and convert it to a string?


I am working in databricks where I read a json file from s3 and need to convert it to a string. The json file has multiple layers to it. For the demo, lets say it looks like so:

{
  "id": "123",
  "details":[
    {
      "name": "Bob",
      "address": "123 street"
    },
    {
      "name": "Amy",
      "address": "XYZ street"
    }
  ],
  "docType": "File",
  "collections": ["a","b","c"]
}

Said json file is stored in s3 and so I read as so:

aws_s3_bucket = 'my_bucket'
mount_name = '/mnt/test'

source_url = 's3a://%s' %(aws_s3_bucket)
dbutils.fs.mount(source_url,mount_name)

file_path = "/dummyKey/dummyFile.json"
df = spark.read.option("multiline","true").json(mount_name + file_path).cache()

This returns a dataframe with 4 columns: id,details,docType and collections. So a column for each top level field in my json file.

I now would like to convert this dataframe into a json string i.e. the file that is in s3 but without formating.

I tried to make use of toJson() function, as well as to_json from Pandas lib. But in both cases I got the following error:

the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column

which from what I understand means that it cannot do this as it is too complex.

My question is then as follows: how can I read a json file from s3 and convert it to a string? I suspect there is a way to directly read it in as a string without saving to dataframe but I just cannot find such function (provided it exists to begin with).


Solution

  • So, after some more research I didn't find anything that can achieve that. Instead, I resorted to read json file as text and concatenate rows of my df to produce json string. The code to achieve that is as follows:

    def create_json_str(collection):
        output_str = ""
        for row in collection:
            output_str = output_str + row[0]
        return output_str.replace(" ", "")
    
    aws_s3_bucket = 'my_bucket'
    mount_name = '/mnt/test'
    
    source_url = 's3a://%s' %(aws_s3_bucket)
    dbutils.fs.mount(source_url,mount_name)
    
    file_path = "/dummyKey/dummyFile.json"
    df = spark.read.option("multiline","true").text(mount_name + file_path).cache()
    
    df_collection = df.collect()
    json_str = create_json_str(df_collection)
    

    Not the best solution, I think using schemas & reading it as json would be better, but for my purposes it is enough.