I am trying to refresh the .tmp file with additional events in every 5 minutes, my source is slow and it takes 30 min to get 128MB file in my hdfs sink.
Is there any property in flume hdfs sink where I can control the refresh rate of .tmp file before the file is rolled into HDFS.
I need this to see the data in HDFS using hive table from the .tmp file.
Currently I am viewing the data from .tmp file using hive table but the .tmp file is not refreshing for a long time as the roll size is 128MB.
Consider decreasing your channel's capacity and transactionCapacity settings:
capacity 100 The maximum number of events stored in the channel
transactionCapacity 100 The maximum number of events the channel will take from a source or give to a sink per transaction
These settings are responsible for controlling how many events get spooled before they are flushed to your sink. If you lower that to 10 for instance, every 10 events will be flushed to your tmp file.
The second value you will need to change the batchSize in your hdfs sink:
hdfs.batchSize 100 number of events written to file before it is flushed to HDFS
The default value of 100 will probably be too high if you have a very slow source and you want to see events more often.