I have a Kafka broker with a topic connected to Spark Structured Streaming. My topic sends data to my streaming dataframe, and I'd like to get information on each row for this topic (because I need to compare each row with another database).
If I could transform my batches into an RDD I could get each row easily.
I also saw something about DStreams but I don't know if with the last version f spark it still works.
Is DStream the answer to my problem or if there is any other solution to get my data row by row?
Read the data in spark streaming from kafka and write your custom row comparison in foreach writer of spark streaming . eg.
streamingDatasetOfString.writeStream.foreach(
new ForeachWriter[String] {
def open(partitionId: Long, version: Long): Boolean = {
// Open connection
}
def process(record: String): Unit = {
// Write string to connection
}
def close(errorOrNull: Throwable): Unit = {
// Close the connection
}}).start()
` This is supported in python,scala,java since spark 2.4