rdatabrickssparkr

Why does SparkR (on databricks) not display summary function output when working with spark dataframes?


On databricks, start with an R dataframe:

x <- data.frame(n=1:1000)

Now the SparkR summary function provides nice output:

SparkR::summary(x)
   n         

Min. : 1.0
1st Qu.: 250.8
Median : 500.5
Mean : 500.5
3rd Qu.: 750.2
Max. :1000.0

Command took 0.02 seconds -- by @ at 9/9/2020, 9:46:57 AM on aa_cluster_6w

Next I will convert the R dataframe to a spark dataframe:

y <- SparkR::createDataFrame(x=x)

I am able to confirm that object y is indeed, a Spark Dataframe:

class(y)

[1] "SparkDataFrame" attr(,"package") [1] "SparkR"

Command took 0.01 seconds -- by @ at 9/9/2020, 9:47:35 AM on aa_cluster_6w

Unfortunately, the SparkR library doesn't output the function results when I attempt to summarize it:

SparkR::summary(y)

SparkDataFrame[summary:string, n:string]

Command took 0.48 seconds -- by @ at 9/9/2020, 9:47:16 AM on aa_cluster_6w


Solution

  • I figured out how to answer the question myself while I was writing the question. So I might as well record the answer myself:

    The Spark R summary function returns a dataframe, not text, so it must be converted to text. Two ways to do it as follows:

    display(SparkR::summary(y))
    

    or

    SparkR::collect(SparkR::summary(y))
    

    The display function prints Spark Dataframes as nice output in a databricks notebook.

    The Spark R collect function pulls a spark dataframe into a local object in RAM on the driver of the active cluster. This operation is trivial for the tiny dataframe containing the statistical summary.