pysparkpandas-udf

Dividing a set of columns by its average in Pyspark


I have to divide a set of columns in a pyspark.sql.dataframe by their respective column average but I am not able to find an correct way to do it. Below is a sample data and my present code.

Input Data

columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
    ("2", "45", "55", "tracey smith"),
    ("3", "33", "23", "amy sanders")]

df = spark.createDataFrame(data=data,schema=columns)

Col1    Col2    Col3    Name
 1      40      56  john jones
 2      45      55  tracey smith
 3      33      23  amy sanders

Expected Output

Col1    Col2    Col3    Name
 0.5    1.02    1.25    john jones
 1      1.14    1.23    tracey smith
 1.5    0.84    0.51    amy sanders

Function as of now. Not working:

#function to divide few columns by the column average and overwrite the column

def avg_scaling(df):

  #List of columns which have to be scaled by their average
  col_list = ['col1', 'col2', 'col3']

  for i in col_list:
   df = df.withcolumn(i, col(i)/df.select(f.avg(df[i])))

   return df

new_df = avg_scaling(df)

Solution

  • You can make use of a Window here partitioned on a pusedo column and run average on that window.

    The code goes like this,

    columns = ["Col1", "Col2", "Col3","Name"]
    data = [("1","40", "56" , "john jones"),
        ("2", "45", "55", "tracey smith"),
        ("3", "33", "23", "amy sanders")]
    
    df = spark.createDataFrame(data=data,schema=columns)
    
    df.show()
    
    +----+----+----+------------+
    |Col1|Col2|Col3|        Name|
    +----+----+----+------------+
    |   1|  40|  56|  john jones|
    |   2|  45|  55|tracey smith|
    |   3|  33|  23| amy sanders|
    +----+----+----+------------+
    
    
    from pyspark.sql import Window
    
    def avg_scaling(df, cols_to_scale):
        
        w = Window.partitionBy(F.lit(1))
        for col in cols_to_scale:
            df = df.withColumn(f"{col}", F.round(F.col(col) / F.avg(col).over(w), 2))
        return df
    
    new_df = avg_scaling(df, ["Col1", 'Col2', 'Col3'])
    new_df.show()
    
    
    +----+----+----+------------+
    |Col1|Col2|Col3|        Name|
    +----+----+----+------------+
    | 0.5|1.02|1.25|  john jones|
    | 1.5|0.84|0.51| amy sanders|
    | 1.0|1.14|1.23|tracey smith|
    +----+----+----+------------+