I have a problem in Pyspark creating a column based on values in another column for a new dataframe.
It's boring and seems to me not a good practice to use a lot of
CASE
WHEN column_a = 'value_1' THEN 'value_x'
WHEN column_a = 'value_2' THEN 'value_y'
...
WHEN column_a = 'value_289' THEN 'value_xwerwz'
END
The problem seems to me that we are not treating a single row but all of them in one command, so using dict/map/configparser is an unavailable option. I thought about using a loop with dict, but it seems too slow and a waste of computation as we repeat all the conditions.
I'm still looking for this practice, if I find it, I'll post it here. But, you know, probably a lot of people already use it and I don't know yet. But if there is no other way, ok. Use many WHEN THEN conditions won't be a choice.
Thank you
I tried to use a dict and searched for solutions like this
You could create a function which converts a dict into a Spark F.when, e.g.:
import pyspark.sql.functions as F
def create_spark_when(column, conditions, default):
when = None
for key, value in conditions.items():
current_when = F.when(F.col(column) == key, value)
if when is None:
when = current_when.otherwise(default)
else:
when = current_when.otherwise(when)
return when
df = spark.createDataFrame([(0,), (1,), (2,)])
df.show()
my_conditions = {1: "a", 2: "b"}
my_default = "c"
df.withColumn(
"my_column",
create_spark_when("_1", my_conditions, my_default),
).show()
Output:
+---+
| _1|
+---+
| 0|
| 1|
| 2|
+---+
+---+---------+
| _1|my_column|
+---+---------+
| 0| c|
| 1| a|
| 2| b|
+---+---------+