ssisapache-kafkainformaticastreamsets

Kafka vs StreamSets


I was reading articles related to Kafka and StreamSets and my understanding was

  1. Kafka acts as a broker between Producer system and subscriber. Producer push the data into Kafka cluster, subscriber pull the data from Kafka

  2. StreamsSets is a technology to move data from one source to another through a pipeline

Now, below are my questions, Please help to clarify

  1. What is the fundamental difference between Kafka and StreamSets? Is that Kafka doesn't move data but StreamSets moves the data?

  2. If Kafka doesn't move the data, what is Kafka used for? If it moves data like ETL solutions, how it is different from SSIS, Informatica etc?

  3. How is StreamSets different from SSIS, Informatica etc?


Solution

    1. In StreamSets most of the time we create "data pipelines", think of a pipeline like an application which can consist of multiple steps/tasks, first task can be read data from a database or kafka or any number of data sources, second step can be modify the data, third step can be run a script ... etc and finally it can save the transformed data into a destination that could be a database or any other cloud storage. So Kafka and StreamSets can work together where StreamSets can read data from and write to Kafka

    2. I think of Kafka as a place where data from multiple sources is collected and is available for consumers for a certain time. For example Kafka can read from a database table periodically and store the changes in a "topic", read from a web service periodically and then store this data into another topic. These topics are now available to consumers, a developer now can create an application that reads data from the first topic and do something with the data, Kafka can keep track of what the consumer has read by using offsets and offers replication and other options. It removes the need to write custom code that integrates multiple sources and destinations, instead you can configure this part.

    StreamSets can read from and write to Kafka. StreamSets does not store the data in its own system while Kafka stores the data for a configurable period of time.

    1. SSIS is similar to StreamSets in that it is used to create pipelines/packages that consist of multiple tasks, each task can take the data/result from the previous tasks and then does something with it. Both StreamSets and SSIS can connect to many kinds of data sources and destinations.

    My personal view on how StreamSets and SSIS are different is: