I am playing with GroupedData
in pyspark.
This is my environment.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.5.1
/_/
Using Scala version 2.12.18, OpenJDK 64-Bit Server VM, 11.0.24
Branch HEAD
I wonder if the following is possible.
Say I want to use only the methods of GroupedData
, and not import any functions from pyspark.sql.functions
.
OK, suppose I have some DataFrame
and I've grouped it already by column A
and I've got a GroupedData
object back.
Now I want to do on my GroupedData
object say sum(column B)
, and say avg(column C)
and maybe min(column D)
in one shot or via chained method calls.
Can I do this just by using GroupedData methods?
I am asking this because it seems that once I've done sum(column B)
, I don't have a GroupedData
object anymore, and so I cannot continue to chain any GroupedData
methods further.
So is that (what I have in mind) possible or not?
If it's possible, how can we do it?
I do not think that this is possible.
Looking at the source of GroupedData, we see that all functions like
return a DataFrame
, so chaining is not possible.