I want to build a mill job that allows me to develop and run a Spark job locally either by SparkSample.run or having a full fat jar for local tests.
At some point of time I'd like to send it as a filtered assembly (i.e. without all spark related libs, but with all project libs) to a cluster with a running Spark Context.
I currently use this build.sc
import mill._, scalalib._
import mill.modules.Assembly
object SparkSample extends ScalaModule {
def scalaVersion = "2.12.10"
def scalacOptions =
Seq("-encoding", "utf-8", "-explaintypes", "-feature", "-deprecation")
def ivySparkDeps = Agg(
ivy"org.apache.spark::spark-sql:2.4.5"
.exclude("org.slf4j" -> "slf4j-log4j12"),
ivy"org.slf4j:slf4j-api:1.7.16",
ivy"org.slf4j:slf4j-log4j12:1.7.16"
)
def ivyBaseDeps = Agg(
ivy"com.lihaoyi::upickle:0.9.7"
)
// STANDALONE APP
def ivyDeps = ivyBaseDeps ++ ivySparkDeps
// REMOTE SPARK CLUSTER
// def ivyDeps = ivyBaseDeps
// def compileIvyDeps = ivySparkDeps
// def assemblyRules =
// Assembly.defaultRules ++
// Seq(
// "scala/.*",
// "org.slf4j.*",
// "org.apache.log4j.*"
// ).map(Assembly.Rule.ExcludePattern.apply)
}
For running and building a full fat jar, I keep it as is.
For creating a filtered assembly I comment the ivyDeps line under "STANDALONE APP" and uncomment everything below "REMOTE SPARK CLUSTER".
I felt editing a build file for a new task is not very elegant, so I tried to add a separate task to build.sc
def assembly2 = T {
def ivyDeps = ivyBaseDeps
def compileIvyDeps = ivySparkDeps
def assemblyRules =
Assembly.defaultRules ++
Seq(
"scala/.*",
"org.slf4j.*",
"org.apache.log4j.*"
).map(Assembly.Rule.ExcludePattern.apply)
super.assembly
}
but when I run SparkSample.assembly2 it still gets a full assembly and not a filtered one. Seems like overriding ivyDeps et. al. in a Task does not work.
Is this possible in mill?
You can't override defs in a tasks. Just locally defining some ivyDeps and compileIvyDeps will not magically make super.assembly using them.
Of course you can create that task by looking how super.assembly is defined in JavaModule, but you will end up copying and adapting a lot more targets (upstreamAssembly, upstreamAssemblyClasspath, transitiveLocalClasspath, and so on) and make your buildfile hard to read.
A better way would be to make the lighter dependencies and assembly rules the default and move the creation of the standalone JAR into a sub module.
import mill._, scalalib._
import mill.modules.Assembly
object SparkSample extends ScalaModule { outer =>
def scalaVersion = "2.12.10"
def scalacOptions =
Seq("-encoding", "utf-8", "-explaintypes", "-feature", "-deprecation")
def ivySparkDeps = Agg(
ivy"org.apache.spark::spark-sql:2.4.5"
.exclude("org.slf4j" -> "slf4j-log4j12"),
ivy"org.slf4j:slf4j-api:1.7.16",
ivy"org.slf4j:slf4j-log4j12:1.7.16"
)
def ivyDeps = Agg(
ivy"com.lihaoyi::upickle:0.9.7"
)
def compileIvyDeps = ivySparkDeps
def assemblyRules =
Assembly.defaultRules ++
Seq(
"scala/.*",
"org.slf4j.*",
"org.apache.log4j.*"
).map(Assembly.Rule.ExcludePattern.apply)
object standalone extends ScalaModule {
def scalaVersion = outer.scalaVersion
def moduleDeps = Seq(outer)
def ivyDeps = outer.ivySparkDeps
}
}
To create a Spark Cluster JAR run: mill SparkSample.assembly
To create a standalone JAR run: mill SparkSample.standalone.assembly
To create both, you simply run: mill __.assembly