In the current project we need to run some quite complicated calculations on the data exported from our system. The calculations are handled by a third-party software (which is basically a black box for us). We have this software as Linux or Windows binaries, and know how to execute it with our data in the command line.
Processing a single dataset on one CPU core takes around 200 hours. However, we may split the dataset into smaller dataset (structurally equivalent) and run calculations in parallel. Later on, we can easily aggregate the results. Our goal is to be able to process each dataset under 10 hours.
Our customer has a proprietary job processing application. The interface is file system-based: we copy job's EXE-file (yep, it's Windows-backed) and the configuration INI file to the incoming folder, the job processing app executes this job on one of the nodes (handling errors, failover etc.) and finally copies the results to the outgoing folder. This proprietary job processing system has several hundreds of CPU cores, so there's clearly enough power to handle our dataset under 10 hours. Even under 30 minutes.
Now, the thing is, our application is so far J2EE-based, more-or-less standard JBoss app. And we need to:
To me, many of the parts of what we have to do look very similar to Enterprise Application Intergation Patterns like Splitter and Aggregator. So I was thinking if Apache Camel would be a good fit for the implementation:
However, I have no experience with Apache Camel yet so I've decided to ask advice on the applicability.
Given the problem described above, do you think Apache Camel would be a good match for the task?
Closing note: I'm not looking for external resources or a tool/library suggestion. Just a confirmation (or the opposite), if I'm on the right track with Apache Camel.
You have quite a complicated use case over there. Let me re-phrase what you would like to do in a simple format and provide my thoughts. If you see I miss understood something just leave me a comment and I will revise my post.
JBoss based J2EE application that has a large dataset that needs to be transformed split into smaller pieces and then transformed into a custom format. This format will then be written out to disk and processed by another application which will create new data results in an output folder on the disk. You then want to pick up this output and aggregate the results.
I would say that apache camel can do this, but you will have to take the time to properly tune the system to your needs and setup a few custom configurations on your components. I imagine this process looking something like:
from("my initial data source")
.split().method(CustomBean.class, "customSplitMethod")
//You might want some sort of round robin pattern to
//distribute between the different directories
.to("file://customProgramInputDirectory");
from("file://customProgramOutputDirectory")
.aggregate(constant(true), new MyCustomAggregationStratedgy())
.to("output of your data source");
Since you said you will be integrating with a "proprietary queue-like job processing system", I might have misunderstood the input and output of the other program to be fileDirectories, if it is a queue based system and it supports jms there is a generic template you can use, if not its always possible to create a custom camel component so your pattern would just change from saying 'file://' to 'MyCustomEndpoint://'