apache-kafkasynchronizationiotdistributed-systemstream-processing

Synchronize Data From Multiple Data Sources


Our team is trying to build a predictive maintenance system whose task is to look at a set of events and predict whether these events depict a set of known anomalies or not.

We are at the design phase and the current system design is as follows:

Problem:

In order to classify a set of events as an anomaly, the events have to occur in the same time window. e.g. say there are three data sources pushing their respective events into Kafka topics, but due to some reason, the data is not synchronized. So one of the inference engines pulls the latest entries from each of the kafka topics, but the corresponding events in the pulled data do not belong to the same time window (say 1 hour). That will result in invalid predictions due to out-of-sync data.

Question

We need to figure out how can we make sure that the data from all three sources are pushed in-order so that when an inference engine requests entries (say the last 100 entries) from multiple kakfa topics, the corresponding entries in each topic belong to the same time window?


Solution

  • I would suggest KSQL, which is a streaming SQL engine that enables real-time data processing against Apache Kafka. It also provides nice functionality for Windowed Aggregation etc.

    There are 3 ways to define Windows in KSQL:

    hopping windows, tumbling windows, and session windows. Hopping and tumbling windows are time windows, because they're defined by fixed durations they you specify. Session windows are dynamically sized based on incoming data and defined by periods of activity separated by gaps of inactivity.

    In your context, you can use KSQL to query and aggregate the topics of interest using Windowed Joins. For example,

    SELECT t1.id, ...
    FROM topic_1 t1
    INNER JOIN topic_2 t2
    WITHIN 1 HOURS
    ON t1.id = t2.id;