On databricks, start with an R dataframe:
x <- data.frame(n=1:1000)
Now the SparkR summary function provides nice output:
SparkR::summary(x)
n
Min. : 1.0
1st Qu.: 250.8
Median : 500.5
Mean : 500.5
3rd Qu.: 750.2
Max. :1000.0
Command took 0.02 seconds -- by @ at 9/9/2020, 9:46:57 AM on aa_cluster_6w
Next I will convert the R dataframe to a spark dataframe:
y <- SparkR::createDataFrame(x=x)
I am able to confirm that object y is indeed, a Spark Dataframe:
class(y)
[1] "SparkDataFrame" attr(,"package") [1] "SparkR"
Command took 0.01 seconds -- by @ at 9/9/2020, 9:47:35 AM on aa_cluster_6w
Unfortunately, the SparkR library doesn't output the function results when I attempt to summarize it:
SparkR::summary(y)
SparkDataFrame[summary:string, n:string]
Command took 0.48 seconds -- by @ at 9/9/2020, 9:47:16 AM on aa_cluster_6w
I figured out how to answer the question myself while I was writing the question. So I might as well record the answer myself:
The Spark R summary function returns a dataframe, not text, so it must be converted to text. Two ways to do it as follows:
display(SparkR::summary(y))
or
SparkR::collect(SparkR::summary(y))
The display function prints Spark Dataframes as nice output in a databricks notebook.
The Spark R collect function pulls a spark dataframe into a local object in RAM on the driver of the active cluster. This operation is trivial for the tiny dataframe containing the statistical summary.