sqlamazon-s3databricksamazon-s3-select

Empty Column not being listed in S3 select in databricks


I'm querying a JSON file in S3 with multiple columns:

SELECT a, b, c FROM json.`s3://my-bucket/file.json.gz`

And the file looks like this:

{a: {}, b: 0, c: 1}
{a: {}, b: 1, c: 2}
{a: {}, b: 2, c: 3}

The query above fails and returns

UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `a` cannot be resolved. Did you mean one of the following? [`b`, `c`]

And when I perform

SELECT * FROM json.`s3://my-bucket/file.json.gz`

I get only the columns b and c.

Is there a way where I can also get column a, and also see that it is an empty JSON?


Solution

  • Can you use Python or Scala syntax?
    You need to impose schema on the json file during reading the json files, and as far as I know it's not possible through SQL queries alone.
    The solution using Python syntax would look like this:

    from pyspark.sql.types import *
    
    # Not sure what the data type for column a is supposed be, so apply the correct data type.
    
    schema = StructType([
      StructField('a', StringType(), True),
      StructField('b', IntegerType(), True),
      StructField('c', IntegerType(), True),  
    ])
    
    df = spark.read.schema(schema).json('s3://my-bucket/file.json.gz')