javascalaapache-spark

scala vs java for Spark?


Can someone help me understand why people is using scala over Java for spark? I have been researching but haven't been able to find a solid answer, I know both works fine as they both run on JVM and I know scala us functional and OOP language.

Thanks


Solution

  • Spark was written in Scala. Spark also came out before Java 8 was available which made functional programming more cumbersome. Also, Scala is closer to Python while still running in a JVM. Data Scientists were the original target users for Spark. Data Scientists would traditionally have more of a background in Python, so Scala makes more sense for them to use than go straight to Java.

    Here is a direct quote from one of the guys who initially wrote Spark from a Reddit AMA they did. The question was:

    Q:

    How important was it to create Spark in Scala? Would it have been feasible / realistic to write it in Java or was Scala fundamental to Spark?

    A from Matei Zahara:

    At the time we started, I really wanted a PL that supports a language-integrated interface (where people write functions inline, etc), because I thought that was the way people would want to program these applications after seeing research systems that had it (specifically Microsoft's DryadLINQ). However, I also wanted to be on the JVM in order to easily interact with the Hadoop filesystem and data formats for that. Scala was the only somewhat popular JVM language then that offered this kind of functional syntax and was also statically typed (letting us have some control over performance), so we chose that. Today there might be an argument to make the first version of the API in Java with Java 8, but we also benefitted from other aspects of Scala in Spark, like type inference, pattern matching, actor libraries, etc.

    Edit

    Here's the link in case folks were interested in more on what Matei had to say: https://www.reddit.com/r/IAmA/comments/31bkue/im_matei_zaharia_creator_of_spark_and_cto_at/