[SOLVED] How to add Kafka dependencies for PySpark on a Jupyter notebook

How to add Kafka dependencies for PySpark on a Jupyter notebook

I have setup kafka 2.1 on windows and able to successfully communicate a topic from producer to consumer over localhost:9092.

I now want to consume this in a spark structured stream.

For this I setup spark 3.4 and installed pyspark over Jupyter kernel and its working well.

The issue I have now is with how to correctly configure the Kafka spark dependency jars on Jupyter. I have tried the following:

spark = SparkSession \
    .builder \
    .appName("KafkaStreamingExample") \
    .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0') \
    .getOrCreate()

stream_df = spark.readStream\
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "stocky") \
    .load()

I get the error

Failed to find data source: kafka

I know there are options to load the packages with spark-submit, but I particularly need to know if its possible to get it working within the Jupyter notebook environment.

It would be great if someone can point me in the right direction.

Solution

What you've done is correct, but version matters. You're running Spark 3.4.0? So you'll be unable to use spark-sql-kafka-0-10 version 3.3.0

You also need the Kafka client JAR for using with pyspark

https://github.com/OneCricketeer/docker-stacks/blob/master/hadoop-spark/spark-notebooks/kafka-sql.ipynb