scalaapache-sparksbtdatabricksapache-spark-standalone

Spark Standalone : how to avoid sbt assembly and uber-jar?


I have sbt.build like that, to do Spark programming :

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "3.0.1" withSources(),
  "com.datastax.spark" %% "spark-cassandra-connector" % "3.0.0" withSources()
  ...
)

As my program use other libraries than Spark itself, I have to use sbt assembly to generate an uber Jar, that I can use as an argument to spark-submit, in order to run such spark application in my spark standalone cluster.

The resulting uber-jar output works like a charm.

However the compilation took a lot of time, and I find such method too slow to iterate in my development.

I mean, at every Spark application code change, that I want to test, i have to run another compilation that output the uber-jar with sbt, and each time, it takes very long (at least 5 minutes) to be done, before I can run it on my cluster.

I know that I may optimize a bit the build.sbt to faster a bit the compilation. But I think it will remain slow.

So, my question is, if you know that there are other methods that completely avoid to build an uber-jar ?

Ideally I think about a method, that I just have to trigger sbt package (lot faster than sbt assembly), and where I then just can tell at the spark-submit level or at the spark standalone cluster level, which additionnal jars to load.

However for instance, the spark-submit seems clear about that..

application-jar : Path to a bundled jar including your application and all dependencies

.. so may be I have no other choice ..

Any pointers to speed up my Spark development with Scala, SBT, and additional librairies ?


Solution

  • It's not necessary that you put all dependent libraries into assembly/fat jar - they simply should be available to your application in the runtime. This could be done differently:

    See Spark documentation for more details.

    Also, dependencies of the Spark itself shouldn't be packed into assembly - they need to be marked as provided instead

    P.S. If you'll run your code on Databricks, then you can install libraries into the cluster via UI, or APIs, although you may still have a case when you need to put your libraries into assembly - this sometimes is required because of the dependencies conflicts