clouderaflumehortonworks-data-platformflume-ngflume-twitter

Flume-ng hdfs sink .tmp file refresh rate control proprty


I am trying to refresh the .tmp file with additional events in every 5 minutes, my source is slow and it takes 30 min to get 128MB file in my hdfs sink.

Is there any property in flume hdfs sink where I can control the refresh rate of .tmp file before the file is rolled into HDFS.

I need this to see the data in HDFS using hive table from the .tmp file.

Currently I am viewing the data from .tmp file using hive table but the .tmp file is not refreshing for a long time as the roll size is 128MB.


Solution

  • Consider decreasing your channel's capacity and transactionCapacity settings:

    capacity    100 The maximum number of events stored in the channel
    transactionCapacity 100 The maximum number of events the channel will take from a source or give to a sink per transaction
    

    These settings are responsible for controlling how many events get spooled before they are flushed to your sink. If you lower that to 10 for instance, every 10 events will be flushed to your tmp file.

    The second value you will need to change the batchSize in your hdfs sink:

    hdfs.batchSize  100 number of events written to file before it is flushed to HDFS
    

    The default value of 100 will probably be too high if you have a very slow source and you want to see events more often.