dataframepysparkspark-structured-streamingspark-streaming-kafkadiscretization

How do I get the data of one row of a Structured Streaming Dataframe in pyspark?


I have a Kafka broker with a topic connected to Spark Structured Streaming. My topic sends data to my streaming dataframe, and I'd like to get information on each row for this topic (because I need to compare each row with another database).

If I could transform my batches into an RDD I could get each row easily.
I also saw something about DStreams but I don't know if with the last version f spark it still works.

Is DStream the answer to my problem or if there is any other solution to get my data row by row?


Solution

  • Read the data in spark streaming from kafka and write your custom row comparison in foreach writer of spark streaming . eg.

    streamingDatasetOfString.writeStream.foreach(
    

    new ForeachWriter[String] {

    def open(partitionId: Long, version: Long): Boolean = {
      // Open connection
    }
    
    def process(record: String): Unit = {
      // Write string to connection
    }
    
    def close(errorOrNull: Throwable): Unit = {
      // Close the connection
    }}).start()
    

    ` This is supported in python,scala,java since spark 2.4