mavenjarnoclassdeffounderrorapache-flinkflinkml

FlinkMLTools NoClassDef when running jar built with maven


I'm working on a recommender system using Apache Flink. The implementation is running when I test it in IntelliJ, but I would like now to go on a cluster. I also built a jar file and tested it locally to see if all was working but I encountered a problem.

java.lang.NoClassDefFoundError: org/apache/flink/ml/common/FlinkMLTools$

As we can see, the class FlinkMLTools used in my code isn't found during the running of the jar. I built this jar with Maven 3.3.3 with mvn clean install and I'm using the version 0.9.0 of Flink.

First Trail

The fact is that my global project contains other projects (and this recommender is one of the sub-project). In that way, I have to launch the mvn clean install in the folder of the right project, otherwise Maven always builds a jar of an other project (and I don't understand why). So I'm wondering if there could be a way to say explicitly to maven to build one specific project of the global project. Indeed, perhaps the path to FlinkMLTools is contained in a link present in the pom.xml file of the global project.

Any other ideas?


Solution

  • The problem is that Flink's binary distribution does not contain the libraries (flink-ml, gelly, etc.). This means that you either have to ship the library jar files with your job jar or that you have to copy them manually to your cluster. I strongly recommend the first option.

    Building a fat-jar to include library jars

    The easiest way to build a fat jar which does not contain unnecessary jars is to use Flink's quickstart archetype to set up the project's pom.

    mvn archetype:generate -DarchetypeGroupId=org.apache.flink \
    -DarchetypeArtifactId=flink-quickstart-scala -DarchetypeVersion=0.9.0 
    

    will create the structure for a Flink project using the Scala API. The generated pom file will have the following dependencies.

    <dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-scala</artifactId>
            <version>0.9.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala</artifactId>
            <version>0.9.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients</artifactId>
            <version>0.9.0</version>
        </dependency>
    </dependencies>
    

    You can remove flink-streaming-scala and instead you insert the following dependency tag in order to include Flink's machine learning library.

    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-ml</artifactId>
        <version>0.9.0</version>
    </dependency>
    

    When you know build the job jar with mvn package, the generated jar should contain the flink-ml jar and all of its transitive dependencies.

    Copying the library jars manually to the cluster

    Flink includes all jars which are located in the <FLINK_ROOT_DIR>/lib folder in the classpath of the executed jobs. Thus, in order to use Flink's machine learning library you have to put the flink-ml jar and all needed transitive dependencies into the /lib folder. This is rather tricky, since you have to figure out which transitive dependencies are actually needed by your algorithm and, consequently, you will often end up copying all transitive dependencies.

    How to build a specific sub-module with maven

    In order to build a specific sub-module X from your parent project you can use the following command:

     mvn clean package -pl X -am
    

    -pl allows you to specify which sub-modules you want to build and -am tells maven to also build other required sub-modules. It is also described here.