I have a SparkR DataFrame
and I want to get the mode (most often) value
for each unique name
. How can I do this? There doesn't seem to be a built-in mode
function. Either a SparkR or PySpark solution will do.
# Create DF
df <- data.frame(name = c("Thomas", "Thomas", "Thomas", "Bill", "Bill", "Bill"),
value = c(5, 5, 4, 3, 3, 7))
DF <- createDataFrame(df)
name | value
-----------------
Thomas | 5
Thomas | 5
Thomas | 4
Bill | 3
Bill | 3
Bill | 9
# What I want to get
name | mode(value)
-----------------
Thomas | 5
Bill | 3
You could achieve that using combination of .groupBy()
and window
methods like this:
grouped = df.groupBy('name', 'value').count()
window = Window.partitionBy("name").orderBy(desc("count"))
grouped\
.withColumn('order', row_number().over(window))\
.where(col('order') == 1)\
.show()
outputs:
+------+-----+-----+-----+
| name|value|count|order|
+------+-----+-----+-----+
| Bill| 3| 2| 1|
|Thomas| 5| 2| 1|
+------+-----+-----+-----+