amazon-web-servicesfailoveravailabilityamazon-kinesis

high availability for kinesis data stream consumer


I want to make below architecture for data sending.

producer --> Kinesis Data stream --> consumer

Consumer server can be shut down, therefore I think there should be at least 2 consumers. Is it right?

When there are two consumer for one data stream, is there any way to handle half data per consumer? As I know, there is no way. If each consumer consume same data, it is waste of time, cost. Because I just make 2 consumers for high availability. ( for fail-over )

In web-was architecture , ELB or L4 can help to send half data to each was server by load balancing.

I want to know similar way for kinesis data stream.


Solution

  • When there are two consumer for one data stream, is there any way to handle half data per consumer? As I know, there is no way.

    You are wrong.

    You should go through Kinesis Developer guide or more specifically https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-scaling.html.

    A kinesis stream is consists of 1 or more shards. Each shard can be processed independently.

    Quoting examples from above links,

    The following example illustrates how the KCL helps you handle scaling and resharding:

    For example, if your application is running on one EC2 instance, and is processing one Kinesis data stream that has four shards. This one instance has one KCL worker and four record processors (one record processor for every shard). These four record processors run in parallel within the same process.

    Next, if you scale the application to use another instance, you have two instances processing one stream that has four shards. When the KCL worker starts up on the second instance, it load-balances with the first instance, so that each instance now processes two shards.

    If you then decide to split the four shards into five shards. The KCL again coordinates the processing across instances: one instance processes three shards, and the other processes two shards. A similar coordination occurs when you merge shards.

    You just have to ensure that both the Kinesis Consumer applications (running on different machine) are configured with same application name. KCL tracks application names, shard checkpoints as Dynamo DB table. This dynamo db table is also used to define ownership for shards between consumer applications.

    So, if you have a Kinesis Stream with 4 shards and two consumer applications running on different machines. Then shard balancing will be done in following way.

    ----Shard1---> application-instance-1
    ----Shard2---> application-instance-1
    ----Shard3---> application-instance-2
    ----Shard4---> application-instance-2
    

    Suppose application-instance-1 goes down. Then application-instance-2 will start processing all shards.

    ----Shard1---> application-instance-2
    ----Shard2---> application-instance-2
    ----Shard3---> application-instance-2
    ----Shard4---> application-instance-2