apache-nifikylo

get total number of files from FetchHDFS processor


is thr a way to get total number of files from single run of FetchHDFS processor?

my use-case is ==> read all files from a directory (hdfs), concat them and then do further processing. But to halt merge processor (till all files are in queue), so I need file count to set "Minimum Number of Entries".

I can use wait/notify, but then I still need total count so set flags correctly.

In any case, doesn't it sound logical to have this as an attribute for FetchHDFS or any file listing processor.

Update#2 (merge Processor) As per the configuration, merge processor should let file go every 300 sec. In my usecase, total input files are 2000, but they are coming in slow place (approx 200 sec). So below configuration should be good enough to merge all the file. But it is not working. I can still see merge processor letting files go in much smaller interval. enter image description here

Update #3 == total size of all 1600 file is 318 KB, which is far less than bin size 128 MB

enter image description here


Solution

  • ListHDFS/FetchHDFS doesn't provide the number of files picked up in a particular run. You can, however use ExecuteScript or UpdateAttribute and with the help of Wait/Notify, and make it work.

    The simplest solution I would suggest is, MergeContent also takes one optional property called Max Bin Age, you can configure some time units here, like 2 mins or 30 secs and set Minimum Number of Entries to some higher number. This way, regardless of the queue size not matching the configured number in Min. number of entries, once the time configured for Max bin age elapses, those queued files will be picked up and merged together. This might require some assumptions and experimentation to get the correct configuration done though.