scalaapache-sparkmachine-learningdistributed-computingscala-breeze

ClassNotFoundException: breeze.storage.Zero$DoubleZero$


I'm trying to run a distributed Kmeans using a distributed Kmeans of Spark MLLIB and I'm getting the following error:

Caused by: java.lang.ClassNotFoundException: breeze.storage.Zero$DoubleZero$
    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

I'm using scala 2.13.0 and spark 3.3.0. and breeze 2.1.0 Does anyone know how to solve it?


Solution

  • Looks like an issue with dependencies.

    In Breeze 1.3- breeze.storage.Zero.DoubleZero was defined as

    @SerialVersionUID(1L)
    implicit object DoubleZero extends Zero[Double] {
      override def zero = 0.0
    }
    

    https://github.com/scalanlp/breeze/blob/releases/v1.3/math/src/main/scala/breeze/storage/Zero.scala#L77

    and breeze.storage.Zero.DoubleZero.getClass produced breeze.storage.Zero$DoubleZero$.

    But in Breeze 2.0+ DoubleZero is defined as

    implicit val DoubleZero: Zero[Double] = Zero(0.0)
    

    https://github.com/scalanlp/breeze/blob/releases/v2.0/math/src/main/scala/breeze/storage/Zero.scala#L46

    @SerialVersionUID(1L)
    case class Zero[@specialized T](zero: T) extends Serializable
    

    and breeze.storage.Zero.DoubleZero.getClass produces breeze.storage.Zero$mcD$sp (because of @specialized) while Class.forName("breeze.storage.Zero$DoubleZero$") throws ClassNotFoundException.

    You should look what dependency still uses Breeze 1.3-


    Update. Thanks for MCVE.

    Debugging shows that NoClassDefFoundError/ClassNotFoundException is thrown here

      private lazy val loadableSparkClasses: Seq[Class[_]] = {
        Seq(
          // ...
          "org.apache.spark.ml.linalg.SparseMatrix",   // <---
          // ...
        ).flatMap { name =>
          try {
            Some[Class[_]](Utils.classForName(name))   // <---
          } catch {
            case NonFatal(_) => None // do nothing
            case _: NoClassDefFoundError if Utils.isTesting => None // See SPARK-23422.
          }
        }
      }
    

    https://github.com/apache/spark/blob/v3.3.0/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L521

    Simpler reproduction is

    Class.forName("org.apache.spark.ml.linalg.SparseMatrix")
    // java.lang.NoClassDefFoundError: breeze/storage/Zero$DoubleZero$ ...
    // Caused by: java.lang.ClassNotFoundException: breeze.storage.Zero$DoubleZero$ ...
    

    As I said, one of dependencies uses Breeze 1.3- although you're thinking that you're using Breeze 2.1.0. Namely, org.apache.spark.ml.linalg.SparseMatrix is from spark-mllib-local and spark-mllib-local 3.3.0 uses Breeze 1.2

    <dependency>
        <groupId>org.scalanlp</groupId>
        <artifactId>breeze_2.13</artifactId>
        <version>1.2</version>
        <scope>compile</scope>
        <exclusions>
            <exclusion>
                <artifactId>commons-math3</artifactId>
                <groupId>org.apache.commons</groupId>
            </exclusion>
        </exclusions>
    </dependency>
    

    https://repo1.maven.org/maven2/org/apache/spark/spark-mllib-local_2.13/3.3.0/spark-mllib-local_2.13-3.3.0.pom

    So Spark 3.3.0 (and 3.3.2) is incompatible with Breeze 2.0+. Use Breeze 1.3-

    scalaVersion := "2.13.0"
    
    libraryDependencies ++= Seq(
      "org.apache.spark" %% "spark-sql"   % "3.3.0",
      "org.apache.spark" %% "spark-mllib" % "3.3.0",
      "org.scalanlp"     %% "breeze"      % "1.3"
    )
    

    Then your code runs successfully.

    Compatibility issues between different versions of Spark and Breeze are not rare:

    https://github.com/scalanlp/breeze/issues/710

    Apache Spark - java.lang.NoSuchMethodError: breeze.linalg.Vector$.scalarOf()Lbreeze/linalg/support/ScalarOf

    https://github.com/scalanlp/breeze/issues/690

    Breeze should be upgraded to 2.0 in Spark 3.4.0

    https://issues.apache.org/jira/browse/SPARK-39616

    Meanwhile you can try it with the following build.sbt

    scalaVersion := "2.13.0"
    
    resolvers += "apache-repo" at "https://repository.apache.org/content/groups/snapshots"
    
    libraryDependencies ++= Seq(
      "org.apache.spark" %% "spark-sql"   % "3.4.0-SNAPSHOT",
      "org.apache.spark" %% "spark-mllib" % "3.4.0-SNAPSHOT",
      "org.scalanlp"     %% "breeze"      % "2.1.0"
    )
    

    Then your code runs successfully too.