scalaapache-sparkapache-kafkaspark-kafka-integration

Difference between spark-streaming-kafka-0-10 vs spark-sql-kafka-0-10


I am hoping to read a parquet file and write to Kafka

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.struct
import org.apache.spark.sql.functions.to_json

object IngestFromS3ToKafka {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession
      .builder()
      .master("local[*]")
      .appName("ingest-from-s3-to-kafka")
      .config("spark.ui.port", "4040")
      .getOrCreate()

    val filePath = "s3a://my-bucket/my.parquet"
    spark.read.parquet(filePath)
      .select(to_json(struct("*")).alias("value"))
      .write
      .format("kafka")
      .option("kafka.bootstrap.servers", "hm-kafka-kafka-bootstrap.hm-kafka.svc:9092")
      .option("topic", "my-topic")
      .save()

    spark.stop()
  }
}

Based on Structured Streaming + Kafka Integration Guide, it seems I should use library spark-sql-kafka-0-10 which can do both batch processing and streaming.

Then I found these two libraries:

In my case, it is about batch instead of streaming. However, based on their names and descriptions, both seem related with streaming. What is difference between these two libraries?

Is there any document regarding to their difference? Thanks!


Solution

  • Oh I found the description in the Spark Streaming Programming Guide:

    Spark Streaming is the previous generation of Spark’s streaming engine. There are no longer updates to Spark Streaming and it’s a legacy project. There is a newer and easier to use streaming engine in Spark called Structured Streaming. You should use Spark Structured Streaming for your streaming applications and pipelines. See Structured Streaming Programming Guide.

    And inside it mentions spark-streaming-kafka-0-10.

    So I think spark-streaming-kafka-0-10 is a legacy project, spark-sql-kafka-0-10 is the new one.