pythondataframepysparkleft-joindatabricks

How to add multiple LOOKUPVALUE columns in the table with filter on the another table efficiently (in Python)?


I need advices, because I don't have much experience with Python.

I'm thinking a better way (for performance, if it exists) to add multiple LOOKUPVALUE columns in the same table in Python. So, I have the following function to reuse multiple times (8 times) to add LOOKUPVALUE into the table with filtering.

def get_column_value_in_question_by_pk(df, df1, filter, result_column_name, alt_value):

    """Filter column Question by parameter "filter" and rename the lookupvalue column to result_column"""
    df1 = df1.filter(F.col("column_filter") == filter).withColumnRenamed(
        "lookupvalue_column", result_column_name
    )

    df = df.join(df1, df1["pk"] == df["pk"], "left").select(df["*"], df1[result_column_name])

    df = df.withColumn(
        result_column_name,
        F.when(F.col(result_column_name).isNull(), alt_value).otherwise(F.col(result_column_name)),
    )
    return df

It's fine to use it now, but when I reuse it 8 times to create 1 table (because I need those extra columns), then I don't think it's efficient.

Is there better way to do it?

 **Input data:**

    | pk       | column_filter  | lookupvalue_column |
    | -------- | -------------- | --------------     |
    | 123acb   | Location       | City1              |
    | 456bca   | Location       | City2              |
    | 123acb   | Question1      | Unhappy            |
    | 456bca   | Question1      | Disappointed       |
    | 123acb   | Question2      | Happy              |
    | 456bca   | Question2      | Very happy         |

**Expected output:**

    | pk       | Name          | result_column1(filter on Question1)    | result_column2(filter on Question2)     | result_column_name3(on Location)|
    | -------- | ------------- | ---------------------------            | -----------------------                 | --------------------------|
    | 123acb   | Name1         | Unhappy                                | Happy                                   | City1                     |
    | 456bca   | Name2         | Disappointed                           | Very happy                              | City2                     |

                

I call the function like this to add new column into the df dataframe:

get_column_value_in_question_by_pk(df, df1, "Location", "result_column_name3", "None")

Solution

  • You can simply achieve this by pivotting the DataFrame. In case if there is the same question per pk, the first occurance of lookupvalue_column per pk is used.

    from pyspark.sql import functions as f
    
    data1 = [
        ("123acb", "Location", "City1",),
        ("456bca", "Location", "City2"),
        ("123acb", "Question1", "Unhappy"),
        ("456bca", "Question1", "Disappointed"),
        ("123acb", "Question2", "Happy"),
        ("456bca", "Question2", "Very Happy"),
        ("456bca", "Question3", None),
    ]
    
    columns1 = ["pk", "column_filter", "lookupvalue_column"]
    df1 = spark.createDataFrame(data1, columns1)
    
    data2 = [
        ("123acb", "Name1"),
        ("456bca", "Name2"),
    ]
    
    columns2 = ["pk", "Name"]
    df2 = spark.createDataFrame(data2, columns2)
    
    df_pivot = (
        df1.withColumn("pivot_col", f.concat(f.lit("result_"), f.col("column_filter")))
        .groupBy("pk")
        .pivot("pivot_col")
        .agg(f.first(f.col("lookupvalue_column")))
    )
    
    df_result = (
        df2
        .join(df_pivot, ["pk"], "left")
        .fillna("default_val", subset=["result_Question3"])
    )
    
    

    Result df_pivot:

    +------+---------------+----------------+----------------+----------------+
    |    pk|result_Location|result_Question1|result_Question2|result_Question3|
    +------+---------------+----------------+----------------+----------------+
    |123acb|          City1|         Unhappy|           Happy|            NULL|
    |456bca|          City2|    Disappointed|      Very Happy|            NULL|
    +------+---------------+----------------+----------------+----------------+
    

    Result df_result:

    +------+-----+---------------+----------------+----------------+----------------+
    |    pk| Name|result_Location|result_Question1|result_Question2|result_Question3|
    +------+-----+---------------+----------------+----------------+----------------+
    |123acb|Name1|          City1|         Unhappy|           Happy|     default_val|
    |456bca|Name2|          City2|    Disappointed|      Very Happy|     default_val|
    +------+-----+---------------+----------------+----------------+----------------+