javajakarta-eeglassfishbatch-processingfile-monitoring

Watch Service and Java EE Batch Processing


Context

I'm in the process of drawing a solution to migrate a huge PL/SQL system to Java. The initial step is migrating some ETL jobs that:

  1. Reads CSV, XML, (XLS, which is a new requirement) and Positional files from several ftp / sftp sources
  2. Process the files according to rules stored in the database and write the results to a database table.

Currently this is done by several store procedures and Jobs.

My company is open to suggestions (if it can run in GlassFish 4 and share its logging and connection pool mechanisms, as well as the admin console, it is a plus).

I've done a little bit of research and the following options caught my eye:

  1. Java EE 7 Batch Processing, sounds simple and particularly well fitted for GlassFish 4.
  2. Spring Batch somewhat more mature and very similar to the Java EE 7 standard (which was probably based on it).
  3. Apache Camel, sounds powerful and would spare us from a lot of fiddling with libraries such a Apache POI, but it also looks somewhat complex. Also I'm not sure if it is the best fit for the job (ETL over huge files).
  4. Cook everything by myself. I could create a Application Client to run a Quartz / Spring Scheduler or even EJB Timers

While I'm still open to suggestions (recommendations would be nice), the best fit so far seems to be Java EE 7 Batch Processing.

One more thing, the infrastructure team have a solution to move files from every ftp source to a local directory, so FTP is really not an issue.

Problem

I've read several tutorials about Java EE Batch Processing and, in all of them, some kind of Servlet or EJB Timer is responsible for starting the Jobs:

JobOperator jobOperator = BatchRuntime.getJobOperator(); 
jobOperator.start("job", properties);

I could easily upload a web / ejb project and keep pooling for changes. But I was thinking about a push model:

  1. Application client console application
  2. Main class watches directories for new files
  3. When there is a new file it would start a new job.

My doubts are:

  1. Is this strategy possible/ advisable?
  2. Will I need a JMS queue or some kind of producer / consumer strategy in the middle or should I just call jobOperator.start for every file and trust the batch processing layer to manage the application resources? In other words, if a thousand files are delivery to my folder at once and I call jobOperator.start a thousand times, will GlassFish 4 do some kind of smart enqueuing or should I create some kind of Gate so that no more than n jobs run simultaneously?

Solution

  • I've already implemented a project with Batch Processing in Wildfly (Jboss AS). I'm not familiar with configuration details on Glassfish (not using it anymore because the've dropped enterprise support), however I can give you some insights and guidelines according to my experience. Also, please note that Spring and the Batch spec. on EE 7 are quite similar, and your decision to use either technology must depend on "what else" you want to achieve with your application besides the batching. Do you want an easily maintained web interface? Do you want to depelop a REST api?, etc.

    The ETL jobs you're describing fit pefeclty with the steps and chunks model in the EE 7 spec, so If you've already tried to develop some tests, you may have noticed that you still need to code the file readers and mappers for each file specification. Your reading sources are quite standard, and you will easily find a library to read/stream them and process their data.

    The project I've implemented is quite simple. Customers uplodad files that need to be processed in order to feed a data warehouse. This service is on the "cloud". Files have a defined spec and must be in CSV format. Most processing results are dimentional "Upserts" and fact "erasing prior inserting". The user has a Web interface on which files and batch processing metadata must be shown (processing state, dates, rejected items, etc.). Because it is a cloud service, the files must not reside locally on each server (using S3).

    So the first thing to design are the chunk steps. I didn't want to have an implementation for each file spec., So what I did is to design a "fit all cases" implemetation that process files according to the metadata contained in them and also the job configuration itself. This is the easy part. The second thing to think about is the processing and metadata administration. Here, I developed a REST api and a Web interface that uses it. After all this, Will it scale? Wilfly has thread configuration parameters for the Batch Processing, and you can increase or decrease the thread availability for the JobOperator. Jobs are not submitted if there are not enough threads available. So what happends to those requests? Well, they can reside on memory, a backed up stateful session can be developed, you can definitely implement MQ listener of queued processing requests. What I did was much simpler. The company doesn't have the resources to maintain a cluster, so whe did an elastic configuration that will expand accoding to cpu consumption and requests volume. So far, the application has processed 10 TB of data, from 15 customers, and at max request/processing peak, 3 elastic instances have fired up.

    A file listener is an interesting idea. You can listen to a directory and drop a processing request to a queue or inmediately to the BatchRuntime. It will depend on how you want to scale it, your needed response time, the available resources, etc.

    Feel free to ask me anything.

    Regards.

    EDIT: forgot to mention. I don't really recommend using the Application client unless you've already got something deployed on your organization. The recent security constraints and java SE updates mechanism has made a real hassle to maintain those kind of deployments. Think web.