pythonapache-sparkpyspark

PySpark GroupedData - chain several different aggregation methods


I am playing with GroupedData in pyspark.

This is my environment.

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.1
      /_/

Using Scala version 2.12.18, OpenJDK 64-Bit Server VM, 11.0.24
Branch HEAD

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.GroupedData.html

I wonder if the following is possible.

Say I want to use only the methods of GroupedData, and not import any functions from pyspark.sql.functions.

OK, suppose I have some DataFrame and I've grouped it already by column A and I've got a GroupedData object back.

Now I want to do on my GroupedData object say sum(column B), and say avg(column C) and maybe min(column D) in one shot or via chained method calls.

Can I do this just by using GroupedData methods?

I am asking this because it seems that once I've done sum(column B), I don't have a GroupedData object anymore, and so I cannot continue to chain any GroupedData methods further.

So is that (what I have in mind) possible or not?
If it's possible, how can we do it?


Solution

  • I do not think that this is possible.

    Looking at the source of GroupedData, we see that all functions like

    return a DataFrame, so chaining is not possible.