I have to divide a set of columns in a pyspark.sql.dataframe by their respective column average but I am not able to find an correct way to do it. Below is a sample data and my present code.
Input Data
columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
("2", "45", "55", "tracey smith"),
("3", "33", "23", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
Col1 Col2 Col3 Name
1 40 56 john jones
2 45 55 tracey smith
3 33 23 amy sanders
Expected Output
Col1 Col2 Col3 Name
0.5 1.02 1.25 john jones
1 1.14 1.23 tracey smith
1.5 0.84 0.51 amy sanders
Function as of now. Not working:
#function to divide few columns by the column average and overwrite the column
def avg_scaling(df):
#List of columns which have to be scaled by their average
col_list = ['col1', 'col2', 'col3']
for i in col_list:
df = df.withcolumn(i, col(i)/df.select(f.avg(df[i])))
return df
new_df = avg_scaling(df)
You can make use of a Window here partitioned on a pusedo column and run average on that window.
The code goes like this,
columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
("2", "45", "55", "tracey smith"),
("3", "33", "23", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
df.show()
+----+----+----+------------+
|Col1|Col2|Col3| Name|
+----+----+----+------------+
| 1| 40| 56| john jones|
| 2| 45| 55|tracey smith|
| 3| 33| 23| amy sanders|
+----+----+----+------------+
from pyspark.sql import Window
def avg_scaling(df, cols_to_scale):
w = Window.partitionBy(F.lit(1))
for col in cols_to_scale:
df = df.withColumn(f"{col}", F.round(F.col(col) / F.avg(col).over(w), 2))
return df
new_df = avg_scaling(df, ["Col1", 'Col2', 'Col3'])
new_df.show()
+----+----+----+------------+
|Col1|Col2|Col3| Name|
+----+----+----+------------+
| 0.5|1.02|1.25| john jones|
| 1.5|0.84|0.51| amy sanders|
| 1.0|1.14|1.23|tracey smith|
+----+----+----+------------+