I'm trying to create a column in Pyspark using below code after reading in from json file,
observation_df.withColumn("contained_observations", F.explode(col("contained")))\
.withColumn("code", col("contained_observations.code"))\
.withColumn("code_text", col("code.text"))\
.withColumn("coding", when(col("code").isNotNull(), col("code").getField("coding")).otherwise(None))\
.select(\
col("code"),\
col("code_text")\
# col("coding")\
)\
.printSchema()
The field "coding" does not exist inside the "code struct" However even after including getField() check it still gives below error
AnalysisException: No such struct field coding in text
How can I possibly just include it in my dataframe with None value even if its not present in the input ?
Tried both the below versions as well
.withColumn("coding", when(col("code").isNotNull(), col("code").getField("coding")).otherwise(None))
.withColumn("coding", col("code").getField("coding").isNotNull())
While reading in the json I have not provided the schema as the schema is not fixed and not known before hand, so spark is infering it.
The schema is root -> code(struct) -> text(String)
The solution to this question is to explicitly check the fields attribute of the struct as below
if "coding" in [x.name for x in observation_df.schema["code"].dataType.fields]:
observation_df = observation_df.withColumn("coding", col("code").getField("coding"))
else:
observation_df = observation_df.withColumn("coding", lit(None))