javamavenapache-sparkapache-kafkaspark-kafka-integration

Spark Streaming - Kafka - Java -jar not working but running by Java application it works


I've a simple Java Spark script. That basically it's to return kafka data:

import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
public class App 
{
    public static void main( String[] args )
    {
        System.out.println( "Hello World!" );
        SparkSession spark = SparkSession.builder().appName("Kafka_Load").config("spark.driver.allowMultipleContexts", "true").config("spark.master", "local").getOrCreate();
        Dataset<Row> df = spark.readStream().format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "my_topic").load();
        System.out.println( "Hello World1!" );
    }
}

That runs well when I run by eclipse using run java application. But when I run "java -jar my_jar.jar" it gives the follwoing exception:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;

My pom.xml is the following one:

 <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.12</version>
      <scope>test</scope>
    </dependency>

            <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.0</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>       
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>               
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.2.0</version>
        </dependency>

  </dependencies>

How can I solve this?


Solution

  • First - You need an Uber JAR that actually has the spark-kafka-sql packages included (refer Maven Assembly plugin documentation or "jar-with-dependencies"). You also should remove spark-streaming-kafka since you aren't using it.

    Second - you must use spark-submit to run Spark applications to establish the correct program classpath

    runs well when I run by eclipse

    Eclipse will default to include the dependencies listed in your POM when running the program from the IDE.