is thr a way to get total number of files from single run of FetchHDFS processor?
my use-case is ==> read all files from a directory (hdfs), concat them and then do further processing. But to halt merge processor (till all files are in queue), so I need file count to set "Minimum Number of Entries".
I can use wait/notify, but then I still need total count so set flags correctly.
In any case, doesn't it sound logical to have this as an attribute for FetchHDFS or any file listing processor.
Update#2 (merge Processor) As per the configuration, merge processor should let file go every 300 sec. In my usecase, total input files are 2000, but they are coming in slow place (approx 200 sec). So below configuration should be good enough to merge all the file. But it is not working. I can still see merge processor letting files go in much smaller interval.
Update #3 == total size of all 1600 file is 318 KB, which is far less than bin size 128 MB
ListHDFS/FetchHDFS
doesn't provide the number of files picked up in a particular run. You can, however use ExecuteScript
or UpdateAttribute
and with the help of Wait/Notify
, and make it work.
The simplest solution I would suggest is, MergeContent
also takes one optional property called Max Bin Age
, you can configure some time units here, like 2 mins
or 30 secs
and set Minimum Number of Entries
to some higher number. This way, regardless of the queue size not matching the configured number in Min. number of entries
, once the time configured for Max bin age
elapses, those queued files will be picked up and merged together. This might require some assumptions and experimentation to get the correct configuration done though.