What is the fundamental difference between "Fixed interval micro-batches" and "AvailableNow" Trigger ?
I find the documentation around those confusing.
Is the fundamental difference the fact that AvailableNow shut down when finished and Fixed interval micro-batches never shut down ?
Indeed as far as i am understanding the documentation, AvailableNow does not mean, one micro-batch of everything available, but depending on the size set up, might mean consuming multiple micro-batch up to what was available when the job was triggered. Am I understanding this correctly ?
The other thing i find confusing in the documentation is the idea that the micro-batch size is set up by a property of type maxbytePerTrigger (depending on the data source). If AvailableNow represent one trigger then that is a problem. So does AvailableNow actually means multiple triggers ?
The AvailableNow trigger will process all available data in the source when the query starts. It can process all of that available data in multiple micro-batches using whatever stream configurations you have, such as maxBytesPerTrigger
. Once it finishes processing all of that data, it will exit, and that streaming query will no longer be running on your cluster.
The Fixed interval micro-batch trigger will run a single micro-batch every interval that you specify. Each micro-batch will respect your stream configurations like maxBytesPerTrigger
. Unlike AvailableNow, this trigger will not ever exit on its own. It will keep running until you manually stop it (via query.stop()
) or it encounters an exception.
AvailableNow is useful if you want to incrementally process your source on a one-off basis. Let's say you have some data in S3, and every now and then you want to reprocess the new data. You can spin up a query with AvailableNow, it'll process all the data, and exit. But if you want more real-time processing, you can use a fixed interval trigger.