python-3.xpysparkhl7-fhir

Safely Accessing non existent nested json attribute Pyspark


I'm trying to create a column in Pyspark using below code after reading in from json file,

observation_df.withColumn("contained_observations", F.explode(col("contained")))\
            .withColumn("code", col("contained_observations.code"))\
                .withColumn("code_text", col("code.text"))\
                .withColumn("coding", when(col("code").isNotNull(), col("code").getField("coding")).otherwise(None))\
.select(\
        col("code"),\
        col("code_text")\
        # col("coding")\
       )\
.printSchema()

The field "coding" does not exist inside the "code struct" However even after including getField() check it still gives below error

AnalysisException: No such struct field coding in text

How can I possibly just include it in my dataframe with None value even if its not present in the input ?

Tried both the below versions as well

.withColumn("coding", when(col("code").isNotNull(), col("code").getField("coding")).otherwise(None))
.withColumn("coding",  col("code").getField("coding").isNotNull())

While reading in the json I have not provided the schema as the schema is not fixed and not known before hand, so spark is infering it.

The schema is root -> code(struct) -> text(String)


Solution

  • The solution to this question is to explicitly check the fields attribute of the struct as below

    if "coding" in [x.name for x in observation_df.schema["code"].dataType.fields]:
        observation_df = observation_df.withColumn("coding", col("code").getField("coding"))
    else:
        observation_df = observation_df.withColumn("coding", lit(None))