In an application we have to send sensory data stream from multiple clients to a central server over internet. One obvious solution is to use MOMs (Message Oriented Middlewares) such as Kafka, but I recently learned that we can do this with data base synchronization tools such as oracle Materialized View.
The later approach works in some application (sending data from a central server to multiple clients, inverse directin of our application), but what is the pros and cons of it in our application? Which one is better for sending sensory data stream from multiple (~100) clients to server in terms of speed, security, etc.?
Thanks.
P.S.
For more detail consider an application in which many (about 100) clients have to send streaming data (1MB data per minute) to a central server over internet. The data are needed in server for the sake of online monitoring, analysis and some computation such as machine learning and data mining tasks.
My question is about the difference between db-to-db connection and streaming solutions such as kafka for trasfering data from clients to server.
I'm going to try and break your question down into in order to get a clearer understanding of your current requirements and then build it back up again. This has taken a long time to write so I'd really appreciate it if you do two things off the back of it:
sensory data stream from multiple clients to a central server
You're sending data from multiple locations to a single persistence store
online monitoring
You're going to be triggering further actions based off the raw data and potentially some aggregated data
analysis and some computation such as machine learning and data mining tasks
You're going to be performing some aggregations on the clients' data, i.e. you require aggregations of all of the clients' data to be persisted (however temporarily) somewhere
Because you're talking about materialized views we can assume that all the clients persist data in a database, probably Oracle.
The data coming in from your clients is about the same topic.
You've got ~100 clients, at that amount we can assume that:
You don't work for one of Google, Amazon, Facebook, Quantcast, Apple etc.
Here, I'm not making any comment on how it's actually going to work - it's the start of a discussion based on my lack of knowledge of your systems. The "raw data persistence" can be files, Kafka, a database etc. This is description of the components that are going to be required and a rough guess as to how they will have to connect.
Materialized views are a persisted query. Therefore you have two choices:
As you can probably guess from the tradeoffs you'll have to make I do not like materialized views as the sole solution. We should be trying to reduce the amount of repeated code and single points of failure.
You can still use materialized views though. If we take our diagram and remove all the duplicated arrows in your central location it implies two things.
You could then use a single materialized view for your aggregation layer (if your raw data persistence isn't in Oracle you'll first have to put the data into Oracle).
Now we've decided that you have a single data pipeline your decisions actually become harder. We've decoupled your clients from the central location and the aggregation layer from our raw data persistence. This means that the choices are now yours but they're also considerably easier to change.
Here we need to work out what technologies aren't going to change.
Putting all of this together we end up with the following. There's actually only one question that matters:
How many times do you want to read your raw data off your database.
If the answer to that is once then we've just described middleware of some description. If the answer is more than once then I would reconsider unless you've got some very good disks. Whether you use Kafka for this middle layer is completely up to you. Use whatever you're most familiar with and whatever you're most willing to invest the time into learning and supporting. The amount of data you're dealing with is non-trivial and there's going to be some trial and error getting this right.
One final point about this; we've defined a data pipeline. A single method of data flowing through your system. In doing so, we've increased the flexibility of the system. Want to add more clients, no need to do anything. Want to change the technology behind part of the system, as long as the interface remains the same there's no issue. Want to send data elsewhere, no problem, it's all in the raw data persistence layer.