scalaapache-sparkmill

spark and mill - create an additional task that creates a filtered assembly


I want to build a mill job that allows me to develop and run a Spark job locally either by SparkSample.run or having a full fat jar for local tests. At some point of time I'd like to send it as a filtered assembly (i.e. without all spark related libs, but with all project libs) to a cluster with a running Spark Context.

I currently use this build.sc

import mill._, scalalib._
import mill.modules.Assembly

object SparkSample extends ScalaModule {
  def scalaVersion = "2.12.10"
  def scalacOptions =
    Seq("-encoding", "utf-8", "-explaintypes", "-feature", "-deprecation")

  def ivySparkDeps = Agg(
    ivy"org.apache.spark::spark-sql:2.4.5"
      .exclude("org.slf4j" -> "slf4j-log4j12"),
    ivy"org.slf4j:slf4j-api:1.7.16",
    ivy"org.slf4j:slf4j-log4j12:1.7.16"
  )

  def ivyBaseDeps = Agg(
    ivy"com.lihaoyi::upickle:0.9.7"
  )

  // STANDALONE APP
  def ivyDeps = ivyBaseDeps ++ ivySparkDeps

  // REMOTE SPARK CLUSTER
  // def ivyDeps = ivyBaseDeps
  // def compileIvyDeps = ivySparkDeps
  // def assemblyRules =
  //   Assembly.defaultRules ++
  //     Seq(
  //       "scala/.*",
  //       "org.slf4j.*",
  //       "org.apache.log4j.*"
  //     ).map(Assembly.Rule.ExcludePattern.apply)
}

For running and building a full fat jar, I keep it as is.

For creating a filtered assembly I comment the ivyDeps line under "STANDALONE APP" and uncomment everything below "REMOTE SPARK CLUSTER".

I felt editing a build file for a new task is not very elegant, so I tried to add a separate task to build.sc

  def assembly2 = T {
    def ivyDeps = ivyBaseDeps
    def compileIvyDeps = ivySparkDeps
    def assemblyRules =
      Assembly.defaultRules ++
        Seq(
          "scala/.*",
          "org.slf4j.*",
          "org.apache.log4j.*"
        ).map(Assembly.Rule.ExcludePattern.apply)
    super.assembly
  }

but when I run SparkSample.assembly2 it still gets a full assembly and not a filtered one. Seems like overriding ivyDeps et. al. in a Task does not work.

Is this possible in mill?


Solution

  • You can't override defs in a tasks. Just locally defining some ivyDeps and compileIvyDeps will not magically make super.assembly using them.

    Of course you can create that task by looking how super.assembly is defined in JavaModule, but you will end up copying and adapting a lot more targets (upstreamAssembly, upstreamAssemblyClasspath, transitiveLocalClasspath, and so on) and make your buildfile hard to read.

    A better way would be to make the lighter dependencies and assembly rules the default and move the creation of the standalone JAR into a sub module.

    import mill._, scalalib._
    import mill.modules.Assembly
    
    object SparkSample extends ScalaModule { outer =>
      def scalaVersion = "2.12.10"
      def scalacOptions =
        Seq("-encoding", "utf-8", "-explaintypes", "-feature", "-deprecation")
    
      def ivySparkDeps = Agg(
        ivy"org.apache.spark::spark-sql:2.4.5"
          .exclude("org.slf4j" -> "slf4j-log4j12"),
        ivy"org.slf4j:slf4j-api:1.7.16",
        ivy"org.slf4j:slf4j-log4j12:1.7.16"
      )
    
      def ivyDeps = Agg(
        ivy"com.lihaoyi::upickle:0.9.7"
      )
    
      def compileIvyDeps = ivySparkDeps
    
      def assemblyRules =
        Assembly.defaultRules ++
          Seq(
            "scala/.*",
            "org.slf4j.*",
            "org.apache.log4j.*"
          ).map(Assembly.Rule.ExcludePattern.apply)
    
      object standalone extends ScalaModule {
        def scalaVersion = outer.scalaVersion
        def moduleDeps = Seq(outer)
        def ivyDeps = outer.ivySparkDeps
      }
    }
    

    To create a Spark Cluster JAR run: mill SparkSample.assembly

    To create a standalone JAR run: mill SparkSample.standalone.assembly

    To create both, you simply run: mill __.assembly