I need advices, because I don't have much experience with Python.
I'm thinking a better way (for performance, if it exists) to add multiple LOOKUPVALUE columns in the same table in Python. So, I have the following function to reuse multiple times (8 times) to add LOOKUPVALUE into the table with filtering.
def get_column_value_in_question_by_pk(df, df1, filter, result_column_name, alt_value):
"""Filter column Question by parameter "filter" and rename the lookupvalue column to result_column"""
df1 = df1.filter(F.col("column_filter") == filter).withColumnRenamed(
"lookupvalue_column", result_column_name
)
df = df.join(df1, df1["pk"] == df["pk"], "left").select(df["*"], df1[result_column_name])
df = df.withColumn(
result_column_name,
F.when(F.col(result_column_name).isNull(), alt_value).otherwise(F.col(result_column_name)),
)
return df
It's fine to use it now, but when I reuse it 8 times to create 1 table (because I need those extra columns), then I don't think it's efficient.
Is there better way to do it?
**Input data:**
| pk | column_filter | lookupvalue_column |
| -------- | -------------- | -------------- |
| 123acb | Location | City1 |
| 456bca | Location | City2 |
| 123acb | Question1 | Unhappy |
| 456bca | Question1 | Disappointed |
| 123acb | Question2 | Happy |
| 456bca | Question2 | Very happy |
**Expected output:**
| pk | Name | result_column1(filter on Question1) | result_column2(filter on Question2) | result_column_name3(on Location)|
| -------- | ------------- | --------------------------- | ----------------------- | --------------------------|
| 123acb | Name1 | Unhappy | Happy | City1 |
| 456bca | Name2 | Disappointed | Very happy | City2 |
I call the function like this to add new column into the df dataframe:
get_column_value_in_question_by_pk(df, df1, "Location", "result_column_name3", "None")
You can simply achieve this by pivotting the DataFrame. In case if there is the same question per pk
, the first occurance of lookupvalue_column
per pk
is used.
from pyspark.sql import functions as f
data1 = [
("123acb", "Location", "City1",),
("456bca", "Location", "City2"),
("123acb", "Question1", "Unhappy"),
("456bca", "Question1", "Disappointed"),
("123acb", "Question2", "Happy"),
("456bca", "Question2", "Very Happy"),
("456bca", "Question3", None),
]
columns1 = ["pk", "column_filter", "lookupvalue_column"]
df1 = spark.createDataFrame(data1, columns1)
data2 = [
("123acb", "Name1"),
("456bca", "Name2"),
]
columns2 = ["pk", "Name"]
df2 = spark.createDataFrame(data2, columns2)
df_pivot = (
df1.withColumn("pivot_col", f.concat(f.lit("result_"), f.col("column_filter")))
.groupBy("pk")
.pivot("pivot_col")
.agg(f.first(f.col("lookupvalue_column")))
)
df_result = (
df2
.join(df_pivot, ["pk"], "left")
.fillna("default_val", subset=["result_Question3"])
)
Result df_pivot
:
+------+---------------+----------------+----------------+----------------+
| pk|result_Location|result_Question1|result_Question2|result_Question3|
+------+---------------+----------------+----------------+----------------+
|123acb| City1| Unhappy| Happy| NULL|
|456bca| City2| Disappointed| Very Happy| NULL|
+------+---------------+----------------+----------------+----------------+
Result df_result
:
+------+-----+---------------+----------------+----------------+----------------+
| pk| Name|result_Location|result_Question1|result_Question2|result_Question3|
+------+-----+---------------+----------------+----------------+----------------+
|123acb|Name1| City1| Unhappy| Happy| default_val|
|456bca|Name2| City2| Disappointed| Very Happy| default_val|
+------+-----+---------------+----------------+----------------+----------------+