OS X El Capitan 10.11.6
Spark 2.2.0 (local)
Scala 2.11.8
Apache Toree Jupyter Kernel 0.2.0
Per the instructions I received from this post, I've successfully included a Spark - Scala
kernel to my Jupyter notebook by using this Toree installer. However, I have noticed that the Scala syntax is very limited. Here are two examples:
1. Not able to manually create a DataFrame
The following code works in a terminal Spark shell:
val test = Seq(
("Brandon", "Erica"),
("Allen", "Sarabeth"),
("Jared", "Kyler")).
toDF("guy", "girl")
But when trying to run in Jupyter with a Spark - Scala
kernel, I receive the following error:
Name: Compile Error
Message: <console>:21: error: value toDF is not a member of Seq[(String, String)]
possible cause: maybe a semicolon is missing before `value toDF'?
toDF("guy", "girl")
^
2. Not able to call column names with certain syntax
It seems as though the Jupyter Spark - Scala
kernel does not recognize columns when called with $"columnName"
, but does recognize columns called with df.col("columnName")
. The $"columnName"
syntax throws the following error:
Name: Compile Error
Message: <console>:31: error: value $ is not a member of StringContext
df.where($"columnName" =!= "NA").
I'm thinking that there is an high-level solution that will allow for all Spark Scala syntax to be used in Jupyter and look forward to the community's support.
I found an answer to another post that also resolved my issues:
val sqlC = new org.apache.spark.sql.SQLContext(sc)
import sqlC.implicits._
Running this at the beginning of the notebook has alleviated all syntax limitations I was previously having.