Is it possible to create a KeyedStream
from a pre-sharded/pre-partitioned Kinesis Data Stream without the need for a network shuffle (i.e. using reinterpretAsKeyedStream
or something similar)?
keyBy
), then is network shuffling at least minimized by doing a keyBy
on a the field that the source is sharded by (e.g. env.addSource(source).keyBy(pojo -> pojo.getTransactionId())
, where the source is a kinesis data stream that is sharded by transactionId
)What I've Learned so Far
reinterpretAsKeyedStream
, but this feature is experimental and seems to have significant drawbacks (as per discussions in the stackoverflow posts below)
reinterpretAsKeyedStream
that I've found are in the context of Kafka, so I'm not sure how the outcomes differ for a Kinesis Data StreamContext of my Application
reinterpretAsKeyedStream
cannot be used)Any help/insight is much appreciated, thanks!
I don't believe there's any way to easily do what you want, at least not in a way that's resilient to changes in the parallelism of your source and your cluster. I have used helicopter stunts to achieve something similar to this, but it involved fragile code (depends on exactly how Flink handles partitioning).