apache-sparkapache-kafkaspark-streamingreactive-programmingspark-streaming-kafka

Apache Kafka and Spark Streaming


I'm reading through this blog post:

http://blog.jaceklaskowski.pl/2015/07/20/real-time-data-processing-using-apache-kafka-and-spark-streaming.html

It discusses about using Spark Streaming and Apache Kafka to do some near real time processing. I completely understand the article. It does show how I could use Spark Streaming to read messages from a Topic. I would like to know if there is a Spark Streaming API that I can use to write messages into Kakfa topic?

My use case is pretty simple. I have a set of data that I can read from a given source at a constant interval (say every second). I do this using reactive streams. I would like to do some analytics on this data using Spark. I want to have fault-tolerance, so Kafka comes into play. So what I would essentially do is the following (Please correct me if I was wrong):

  1. Using reactive streams get the data from external source at constant intervals
  2. Pipe the result into Kafka topic
  3. Using Spark Streaming, create the streaming context for the consumer
  4. Perform analytics on the consumed data

One another question though, is the Streaming API in Spark an implementation of the reactive streams specification? Does it have back pressure handling (Spark Streaming v1.5)?


Solution

    1. No, at the moment, none of Spark Streaming's built-in receiver APIs are an implementation of the Reactive Streams implementation. But there's an issue for that you will want to follow.
    2. But Spark Streaming 1.5 has internal back-pressure-based dynamic throttling. There's some work to extend that beyond throttling in the pipeline. This throttling is compatible with the Kafka direct stream API.

    3. You can write to Kafka in a Spark Streaming application, here's one example.

    (Full disclosure: I'm one of the implementers of some of the back-pressure work)