apache-sparkpysparkapache-spark-sqlconfigparser

Is there a way to use a map/dict in Pyspark to avoid CASE WHEN condition equals pairs?


I have a problem in Pyspark creating a column based on values in another column for a new dataframe. It's boring and seems to me not a good practice to use a lot of

CASE
  WHEN column_a = 'value_1' THEN 'value_x'
  WHEN column_a = 'value_2' THEN 'value_y'
  ...  
  WHEN column_a = 'value_289' THEN 'value_xwerwz'
END

In cases like this, in python, I get used to using a dict or, even better, a configparser file and avoid the if else condition. I just pass the key and python returns the desired value. Also, we have a 'fallback' option for ELSE clause.

The problem seems to me that we are not treating a single row but all of them in one command, so using dict/map/configparser is an unavailable option. I thought about using a loop with dict, but it seems too slow and a waste of computation as we repeat all the conditions.

I'm still looking for this practice, if I find it, I'll post it here. But, you know, probably a lot of people already use it and I don't know yet. But if there is no other way, ok. Use many WHEN THEN conditions won't be a choice.

Thank you

I tried to use a dict and searched for solutions like this


Solution

  • You could create a function which converts a dict into a Spark F.when, e.g.:

    import pyspark.sql.functions as F
    
    def create_spark_when(column, conditions, default):
        when = None
        for key, value in conditions.items():
            current_when = F.when(F.col(column) == key, value)
            if when is None:
                when = current_when.otherwise(default)
            else:
                when = current_when.otherwise(when)
        return when
    
    df = spark.createDataFrame([(0,), (1,), (2,)])
    df.show()
    my_conditions = {1: "a", 2: "b"}
    my_default = "c"
    df.withColumn(
        "my_column",
        create_spark_when("_1", my_conditions, my_default),
    ).show()
    

    Output:

    +---+
    | _1|
    +---+
    |  0|
    |  1|
    |  2|
    +---+
    
    +---+---------+
    | _1|my_column|
    +---+---------+
    |  0|        c|
    |  1|        a|
    |  2|        b|
    +---+---------+