I have sbt.build
like that, to do Spark programming :
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.0.1" withSources(),
"com.datastax.spark" %% "spark-cassandra-connector" % "3.0.0" withSources()
...
)
As my program use other libraries than Spark itself, I have to use sbt assembly
to generate an uber Jar, that I can use as an argument to spark-submit
, in order to run such spark application in my spark standalone
cluster.
The resulting uber-jar
output works like a charm.
However the compilation took a lot of time, and I find such method too slow to iterate in my development.
I mean, at every Spark application code change, that I want to test, i have to run another compilation that output the uber-jar
with sbt, and each time, it takes very long (at least 5 minutes) to be done, before I can run it on my cluster.
I know that I may optimize a bit the build.sbt
to faster a bit the compilation. But I think it will remain slow.
So, my question is, if you know that there are other methods that completely avoid to build an uber-jar
?
Ideally I think about a method, that I just have to trigger sbt package
(lot faster than sbt assembly
), and where I then just can tell at the spark-submit
level or at the spark standalone
cluster level, which additionnal jars to load.
However for instance, the spark-submit
seems clear about that..
application-jar : Path to a bundled jar including your application and all dependencies
.. so may be I have no other choice ..
Any pointers to speed up my Spark development with Scala, SBT, and additional librairies ?
It's not necessary that you put all dependent libraries into assembly/fat jar - they simply should be available to your application in the runtime. This could be done differently:
--jars
- this could be cumbersome, especially if the jars themselves have a lot of dependencies--packages
- in this case, you just provide dependency(-ies) and Spark fetches all of them with all of their dependenciesSee Spark documentation for more details.
Also, dependencies of the Spark itself shouldn't be packed into assembly - they need to be marked as provided
instead
P.S. If you'll run your code on Databricks, then you can install libraries into the cluster via UI, or APIs, although you may still have a case when you need to put your libraries into assembly - this sometimes is required because of the dependencies conflicts