python dataframe pyspark left-join databricks

How to add multiple LOOKUPVALUE columns in the table with filter on the another table efficiently (in Python)?

I need advices, because I don't have much experience with Python.

I'm thinking a better way (for performance, if it exists) to add multiple LOOKUPVALUE columns in the same table in Python. So, I have the following function to reuse multiple times (8 times) to add LOOKUPVALUE into the table with filtering.

def get_column_value_in_question_by_pk(df, df1, filter, result_column_name, alt_value):

    """Filter column Question by parameter "filter" and rename the lookupvalue column to result_column"""
    df1 = df1.filter(F.col("column_filter") == filter).withColumnRenamed(
        "lookupvalue_column", result_column_name
    )

    df = df.join(df1, df1["pk"] == df["pk"], "left").select(df["*"], df1[result_column_name])

    df = df.withColumn(
        result_column_name,
        F.when(F.col(result_column_name).isNull(), alt_value).otherwise(F.col(result_column_name)),
    )
    return df

It's fine to use it now, but when I reuse it 8 times to create 1 table (because I need those extra columns), then I don't think it's efficient.

Is there better way to do it?

 **Input data:**

    | pk       | column_filter  | lookupvalue_column |
    | -------- | -------------- | --------------     |
    | 123acb   | Location       | City1              |
    | 456bca   | Location       | City2              |
    | 123acb   | Question1      | Unhappy            |
    | 456bca   | Question1      | Disappointed       |
    | 123acb   | Question2      | Happy              |
    | 456bca   | Question2      | Very happy         |

**Expected output:**

    | pk       | Name          | result_column1(filter on Question1)    | result_column2(filter on Question2)     | result_column_name3(on Location)|
    | -------- | ------------- | ---------------------------            | -----------------------                 | --------------------------|
    | 123acb   | Name1         | Unhappy                                | Happy                                   | City1                     |
    | 456bca   | Name2         | Disappointed                           | Very happy                              | City2                     |

I call the function like this to add new column into the df dataframe:

get_column_value_in_question_by_pk(df, df1, "Location", "result_column_name3", "None")

Solution

You can simply achieve this by pivotting the DataFrame. In case if there is the same question per pk, the first occurance of lookupvalue_column per pk is used.

from pyspark.sql import functions as f

data1 = [
    ("123acb", "Location", "City1",),
    ("456bca", "Location", "City2"),
    ("123acb", "Question1", "Unhappy"),
    ("456bca", "Question1", "Disappointed"),
    ("123acb", "Question2", "Happy"),
    ("456bca", "Question2", "Very Happy"),
    ("456bca", "Question3", None),
]

columns1 = ["pk", "column_filter", "lookupvalue_column"]
df1 = spark.createDataFrame(data1, columns1)

data2 = [
    ("123acb", "Name1"),
    ("456bca", "Name2"),
]

columns2 = ["pk", "Name"]
df2 = spark.createDataFrame(data2, columns2)

df_pivot = (
    df1.withColumn("pivot_col", f.concat(f.lit("result_"), f.col("column_filter")))
    .groupBy("pk")
    .pivot("pivot_col")
    .agg(f.first(f.col("lookupvalue_column")))
)

df_result = (
    df2
    .join(df_pivot, ["pk"], "left")
    .fillna("default_val", subset=["result_Question3"])
)

Result df_pivot:

+------+---------------+----------------+----------------+----------------+
|    pk|result_Location|result_Question1|result_Question2|result_Question3|
+------+---------------+----------------+----------------+----------------+
|123acb|          City1|         Unhappy|           Happy|            NULL|
|456bca|          City2|    Disappointed|      Very Happy|            NULL|
+------+---------------+----------------+----------------+----------------+

Result df_result:

+------+-----+---------------+----------------+----------------+----------------+
|    pk| Name|result_Location|result_Question1|result_Question2|result_Question3|
+------+-----+---------------+----------------+----------------+----------------+
|123acb|Name1|          City1|         Unhappy|           Happy|     default_val|
|456bca|Name2|          City2|    Disappointed|      Very Happy|     default_val|
+------+-----+---------------+----------------+----------------+----------------+