pysparkspark-streamingazure-databricks

Access specific element in array in a string of JSON format


I have some streaming data, that can minimally be reduced like so:

{
    "data":[
        {
            "key":1,
            "val":"a"
        },
        {
            "key":2,
            "val":"b",
            "test":"bla"
        }
    ]
}

from which I need to access the "data" array which is a string of JSON format. And more specifically I need to find the "val" field in the JSON in it where "key"==2.

So far I have tried:


Solution

  • One approach is to convert the stringified JSON to array of struct type then get the value you want.

    Even though structure varies, if you have some schema that is somewhat stable, you can build a schema that includes what you want to decompose.

    For example, below schema will parse test field if exists, and you will get NULL when it doesn't exist (when key = 1). Also, if you are not interested in test field, you can omit the StructField and test is ignored.

    schema = StructType([
        StructField('data', ArrayType(StructType([
            StructField("key", IntegerType()),
            StructField("val", StringType()),
            StructField("test", StringType()),
            # add more field that you are interested in
        ])))
    ])
    

    Use this schema in from_json then extract the field you want.

    df = (df.withColumn('data', F.from_json('data', schema))
          .withColumn('data', F.filter(F.col('data').data, lambda x: x.key == 2)[0].val))
    

    If there are no key = 2, you will get NULL without getting a hard crash.