[SOLVED] Empty Column not being listed in S3 select in databricks

Empty Column not being listed in S3 select in databricks

I'm querying a JSON file in S3 with multiple columns:

SELECT a, b, c FROM json.`s3://my-bucket/file.json.gz`

And the file looks like this:

{a: {}, b: 0, c: 1}
{a: {}, b: 1, c: 2}
{a: {}, b: 2, c: 3}

The query above fails and returns

UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `a` cannot be resolved. Did you mean one of the following? [`b`, `c`]

And when I perform

SELECT * FROM json.`s3://my-bucket/file.json.gz`

I get only the columns b and c.

Is there a way where I can also get column a, and also see that it is an empty JSON?

Solution

Can you use Python or Scala syntax?
You need to impose schema on the json file during reading the json files, and as far as I know it's not possible through SQL queries alone.
The solution using Python syntax would look like this:

from pyspark.sql.types import *

# Not sure what the data type for column a is supposed be, so apply the correct data type.

schema = StructType([
  StructField('a', StringType(), True),
  StructField('b', IntegerType(), True),
  StructField('c', IntegerType(), True),  
])

df = spark.read.schema(schema).json('s3://my-bucket/file.json.gz')