I'm working on a recommender system using Apache Flink. The implementation is running when I test it in IntelliJ, but I would like now to go on a cluster. I also built a jar file and tested it locally to see if all was working but I encountered a problem.
java.lang.NoClassDefFoundError: org/apache/flink/ml/common/FlinkMLTools$
As we can see, the class FlinkMLTools
used in my code isn't found during the running of the jar.
I built this jar with Maven 3.3.3 with mvn clean install
and I'm using the version 0.9.0 of Flink.
First Trail
The fact is that my global project contains other projects (and this recommender is one of the sub-project). In that way, I have to launch the mvn clean install
in the folder of the right project, otherwise Maven always builds a jar of an other project (and I don't understand why). So I'm wondering if there could be a way to say explicitly to maven to build one specific project of the global project. Indeed, perhaps the path to FlinkMLTools
is contained in a link present in the pom.xml
file of the global project.
Any other ideas?
The problem is that Flink's binary distribution does not contain the libraries (flink-ml, gelly, etc.). This means that you either have to ship the library jar files with your job jar or that you have to copy them manually to your cluster. I strongly recommend the first option.
The easiest way to build a fat jar which does not contain unnecessary jars is to use Flink's quickstart archetype to set up the project's pom.
mvn archetype:generate -DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-scala -DarchetypeVersion=0.9.0
will create the structure for a Flink project using the Scala API. The generated pom file will have the following dependencies.
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala</artifactId>
<version>0.9.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala</artifactId>
<version>0.9.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients</artifactId>
<version>0.9.0</version>
</dependency>
</dependencies>
You can remove flink-streaming-scala
and instead you insert the following dependency tag in order to include Flink's machine learning library.
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-ml</artifactId>
<version>0.9.0</version>
</dependency>
When you know build the job jar with mvn package
, the generated jar should contain the flink-ml
jar and all of its transitive dependencies.
Flink includes all jars which are located in the <FLINK_ROOT_DIR>/lib
folder in the classpath of the executed jobs. Thus, in order to use Flink's machine learning library you have to put the flink-ml
jar and all needed transitive dependencies into the /lib
folder. This is rather tricky, since you have to figure out which transitive dependencies are actually needed by your algorithm and, consequently, you will often end up copying all transitive dependencies.
In order to build a specific sub-module X from your parent project you can use the following command:
mvn clean package -pl X -am
-pl
allows you to specify which sub-modules you want to build and -am
tells maven to also build other required sub-modules. It is also described here.