pythonpysparklambda

Extract specific dictionary value from dataframe in PySpark having case insensitive attributes


I have a below dataframe

dataDictionary = [('value1', [{'key': 'Fruit', 'value': 'Apple'}, {'key': 'Colour', 'value': 'White'}]), 
                 ('value2', [{'key': 'Fruit', 'value': 'Mango'}, {'key': 'Bird', 'value': 'Eagle'}, {'key': 'Colour', 'value': 'Black'}]),
('value3', [{'key': 'Fruit', 'value': 'Apple'}, {'key': 'colour', 'value': 'Blue'}])] 

df = spark.createDataFrame(data=dataDictionary)
df.printSchema()
df.show(truncate=False)
+------+------------------------------------------------------------------------------------------------+
|_1    |_2                                                                                              |
+------+------------------------------------------------------------------------------------------------+
|value1|[{value -> Apple, key -> Fruit}, {value -> White, key -> Colour}]                               |
|value2|[{value -> Mango, key -> Fruit}, {value -> Eagle, key -> Bird}, {value -> Black, key -> Colour}]|
|value3|[{value -> Apple, key -> Fruit}, {value -> Blue, key -> colour}]                               
+------+------------------------------------------------------------------------------------------------+

I wanted to extract only the values of key -> Colour and I'm using below to get the exact result

from pyspark.sql import SparkSession, functions as F

...
df = df.select('_1', F.filter('_2', lambda x: x['key'] == 'Colour')[0]['value'])

result,

_1       _2
value1   White
value2   Black
value3

But for value3, there is no result because key is in lower case colour, for value1 and vaue2 key is in camel case Colour which works with the lambda function F.filter('_2', lambda x: x['key'] == 'Colour')[0]['value']. I tried using upper to handle all three scenarios, but it's not working.

F.filter('_2', lambda x: x['key'].upper() == 'COLOUR')[0]['value']

Any suggestion would be appreciated.


Solution

  • You need to use spark built-in upper function to convert the key to upper case.

    df = df.select('_1', F.filter('_2', lambda x: F.upper(x['key']) == 'COLOUR')[0]['value'])