I want to build a mill
job that allows me to develop and run a Spark job locally either by SparkSample.run
or having a full fat jar for local tests.
At some point of time I'd like to send it as a filtered assembly (i.e. without all spark related libs, but with all project libs) to a cluster with a running Spark Context.
I currently use this build.sc
import mill._, scalalib._
import mill.modules.Assembly
object SparkSample extends ScalaModule {
def scalaVersion = "2.12.10"
def scalacOptions =
Seq("-encoding", "utf-8", "-explaintypes", "-feature", "-deprecation")
def ivySparkDeps = Agg(
ivy"org.apache.spark::spark-sql:2.4.5"
.exclude("org.slf4j" -> "slf4j-log4j12"),
ivy"org.slf4j:slf4j-api:1.7.16",
ivy"org.slf4j:slf4j-log4j12:1.7.16"
)
def ivyBaseDeps = Agg(
ivy"com.lihaoyi::upickle:0.9.7"
)
// STANDALONE APP
def ivyDeps = ivyBaseDeps ++ ivySparkDeps
// REMOTE SPARK CLUSTER
// def ivyDeps = ivyBaseDeps
// def compileIvyDeps = ivySparkDeps
// def assemblyRules =
// Assembly.defaultRules ++
// Seq(
// "scala/.*",
// "org.slf4j.*",
// "org.apache.log4j.*"
// ).map(Assembly.Rule.ExcludePattern.apply)
}
For running and building a full fat jar, I keep it as is.
For creating a filtered assembly I comment the ivyDeps
line under "STANDALONE APP" and uncomment everything below "REMOTE SPARK CLUSTER".
I felt editing a build file for a new task is not very elegant, so I tried to add a separate task to build.sc
def assembly2 = T {
def ivyDeps = ivyBaseDeps
def compileIvyDeps = ivySparkDeps
def assemblyRules =
Assembly.defaultRules ++
Seq(
"scala/.*",
"org.slf4j.*",
"org.apache.log4j.*"
).map(Assembly.Rule.ExcludePattern.apply)
super.assembly
}
but when I run SparkSample.assembly2
it still gets a full assembly and not a filtered one. Seems like overriding ivyDeps
et. al. in a Task does not work.
Is this possible in mill
?
You can't override defs in a tasks. Just locally defining some ivyDeps
and compileIvyDeps
will not magically make super.assembly
using them.
Of course you can create that task by looking how super.assembly
is defined in JavaModule
, but you will end up copying and adapting a lot more targets (upstreamAssembly
, upstreamAssemblyClasspath
, transitiveLocalClasspath
, and so on) and make your buildfile hard to read.
A better way would be to make the lighter dependencies and assembly rules the default and move the creation of the standalone JAR into a sub module.
import mill._, scalalib._
import mill.modules.Assembly
object SparkSample extends ScalaModule { outer =>
def scalaVersion = "2.12.10"
def scalacOptions =
Seq("-encoding", "utf-8", "-explaintypes", "-feature", "-deprecation")
def ivySparkDeps = Agg(
ivy"org.apache.spark::spark-sql:2.4.5"
.exclude("org.slf4j" -> "slf4j-log4j12"),
ivy"org.slf4j:slf4j-api:1.7.16",
ivy"org.slf4j:slf4j-log4j12:1.7.16"
)
def ivyDeps = Agg(
ivy"com.lihaoyi::upickle:0.9.7"
)
def compileIvyDeps = ivySparkDeps
def assemblyRules =
Assembly.defaultRules ++
Seq(
"scala/.*",
"org.slf4j.*",
"org.apache.log4j.*"
).map(Assembly.Rule.ExcludePattern.apply)
object standalone extends ScalaModule {
def scalaVersion = outer.scalaVersion
def moduleDeps = Seq(outer)
def ivyDeps = outer.ivySparkDeps
}
}
To create a Spark Cluster JAR run: mill SparkSample.assembly
To create a standalone JAR run: mill SparkSample.standalone.assembly
To create both, you simply run: mill __.assembly