scalaapache-sparkjupyter-notebookapache-toree

Limited Scala Syntax with Apache Toree Kernel in Jupyter


OS X El Capitan 10.11.6
Spark 2.2.0 (local)
Scala 2.11.8
Apache Toree Jupyter Kernel 0.2.0

Per the instructions I received from this post, I've successfully included a Spark - Scala kernel to my Jupyter notebook by using this Toree installer. However, I have noticed that the Scala syntax is very limited. Here are two examples:

1. Not able to manually create a DataFrame

The following code works in a terminal Spark shell:

val test = Seq(
        ("Brandon", "Erica"),
        ("Allen", "Sarabeth"),
        ("Jared", "Kyler")).
    toDF("guy", "girl")

But when trying to run in Jupyter with a Spark - Scala kernel, I receive the following error:

Name: Compile Error
Message: <console>:21: error: value toDF is not a member of Seq[(String, String)]
possible cause: maybe a semicolon is missing before `value toDF'?
       toDF("guy", "girl")
       ^

2. Not able to call column names with certain syntax It seems as though the Jupyter Spark - Scala kernel does not recognize columns when called with $"columnName", but does recognize columns called with df.col("columnName"). The $"columnName" syntax throws the following error:

Name: Compile Error
Message: <console>:31: error: value $ is not a member of StringContext
   df.where($"columnName" =!= "NA").

I'm thinking that there is an high-level solution that will allow for all Spark Scala syntax to be used in Jupyter and look forward to the community's support.


Solution

  • I found an answer to another post that also resolved my issues:

    val sqlC = new org.apache.spark.sql.SQLContext(sc)
    import sqlC.implicits._ 
    

    Running this at the beginning of the notebook has alleviated all syntax limitations I was previously having.