I need to develop an application that will process csv files as soon as the files are created in a predefined directory. Huge number of incoming files is expected.
I have seen applications using Apache Commons IO File Monitoring in the production. It works pretty well. I have seen it processing as many as 21 million files in a day. It seems Apache Commons IO File Monitoring polls the directory and do listFiles to process the files.
My question: Is JDK WatchService as good an option as Apache Commons IO File Monitoring? Does anyone know of any pros and cons?
Since the time I asked this question, I have got some more insight into the matter. Hence trying to answer for those who might have similar question.
Apache commons monitoring uses a polling mechanism with a configurable polling interval. In every poll, it calls listFiles() method of File class and compares with the listFiles() output of the previous iteration to identify file creation, modification and deletion. The algorithm is robust enough and I have never seen any miss. It works great with even large volume of files. However, since it polls and invokes listFiles in every iteration, it will consume unnecessary CPU cycles, if the input file inflow is not much. Works even on network drives.
JDK WatchService does not need polling. It is event based. It s triggered only when an event occurs and hence less CPU is required if the input file inflow is not that much. If the input file inflow is heavy and the event processing mechanism is processing at a slower rate that the rate at which the event is occurring, there may be a chance of event overflow. Additionally, it will not work with network drives.
Hence, in conclusion, if the file inflow is continuos and huge, it is better to go for Apache File Monitoring. Otherwise, JDK WatchService is a good option.