pysparkapache-kafkajupyterspark-kafka-integration

How to add Kafka dependencies for PySpark on a Jupyter notebook


I have setup kafka 2.1 on windows and able to successfully communicate a topic from producer to consumer over localhost:9092.

I now want to consume this in a spark structured stream.

For this I setup spark 3.4 and installed pyspark over Jupyter kernel and its working well.

The issue I have now is with how to correctly configure the Kafka spark dependency jars on Jupyter. I have tried the following:

spark = SparkSession \
    .builder \
    .appName("KafkaStreamingExample") \
    .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0') \
    .getOrCreate()

stream_df = spark.readStream\
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "stocky") \
    .load()

I get the error

Failed to find data source: kafka

I know there are options to load the packages with spark-submit, but I particularly need to know if its possible to get it working within the Jupyter notebook environment.

It would be great if someone can point me in the right direction.


Solution

  • What you've done is correct, but version matters. You're running Spark 3.4.0? So you'll be unable to use spark-sql-kafka-0-10 version 3.3.0

    You also need the Kafka client JAR for using with pyspark

    https://github.com/OneCricketeer/docker-stacks/blob/master/hadoop-spark/spark-notebooks/kafka-sql.ipynb