I have a below dataframe
dataDictionary = [('value1', [{'key': 'Fruit', 'value': 'Apple'}, {'key': 'Colour', 'value': 'White'}]),
('value2', [{'key': 'Fruit', 'value': 'Mango'}, {'key': 'Bird', 'value': 'Eagle'}, {'key': 'Colour', 'value': 'Black'}]),
('value3', [{'key': 'Fruit', 'value': 'Apple'}, {'key': 'colour', 'value': 'Blue'}])]
df = spark.createDataFrame(data=dataDictionary)
df.printSchema()
df.show(truncate=False)
+------+------------------------------------------------------------------------------------------------+
|_1 |_2 |
+------+------------------------------------------------------------------------------------------------+
|value1|[{value -> Apple, key -> Fruit}, {value -> White, key -> Colour}] |
|value2|[{value -> Mango, key -> Fruit}, {value -> Eagle, key -> Bird}, {value -> Black, key -> Colour}]|
|value3|[{value -> Apple, key -> Fruit}, {value -> Blue, key -> colour}]
+------+------------------------------------------------------------------------------------------------+
I wanted to extract only the values of key -> Colour and I'm using below to get the exact result
from pyspark.sql import SparkSession, functions as F
...
df = df.select('_1', F.filter('_2', lambda x: x['key'] == 'Colour')[0]['value'])
result,
_1 _2
value1 White
value2 Black
value3
But for value3, there is no result because key is in lower case colour
, for value1 and vaue2 key is in camel case Colour
which works with the lambda function F.filter('_2', lambda x: x['key'] == 'Colour')[0]['value']
. I tried using upper to handle all three scenarios, but it's not working.
F.filter('_2', lambda x: x['key'].upper() == 'COLOUR')[0]['value']
Any suggestion would be appreciated.
You need to use spark built-in upper
function to convert the key to upper case.
df = df.select('_1', F.filter('_2', lambda x: F.upper(x['key']) == 'COLOUR')[0]['value'])