javamultithreadingexecutorservicelong-running-processes

Improve Performance for reading file line by line and processing


I have a piece of java code which does the following -

  1. Opens a file with data in format {A,B,C} and each file has approx. 5000000 lines.
  2. For each line in file, call a service that gives a column D and append it to {A,B,C} as {A,B,C,D}.
  3. Write this entry into a chunkedwriter that eventually groups together 10000 lines to write back chunk to a remote location

Right now the code is taking 32 hours to execute. This process would again get repeated across another file which hypothetically takes another 32 hours but we need these processes to run daily.

Step 2 is further complicated by the fact that sometimes the service does not have D but is designed to fetch D from its super data store so it throws a transient exception asking you to wait. We have retries to handle this so an entry could technically be retried 5 times with a max delay of 60000 millis. So we could be looking at 5000000 * 5 in worst case.

The combination of {A,B,C} are unique and thus result D can't be cached and reused and a fresh request has to be made to get D every time.

I've tried adding threads like this:

temporaryFile = File.createTempFile(key, ".tmp");
Files.copy(stream, temporaryFile.toPath(), 
       StandardCopyOption.REPLACE_EXISTING);
reader = new BufferedReader(new InputStreamReader(new 
       FileInputStream(temporaryFile), StandardCharsets.UTF_8));
String entry;
while ((entry = reader.readLine()) != null) {
   final String finalEntry = entry;
   service.execute(() -> {
         try {
             processEntry(finalEntry);
         } catch (Exception e) {
             log.error("something");
   });
   count++;
 }

Here processEntry method abstracts the implementation details explained above and threads are defined as

ExecutorService service = Executors.newFixedThreadPool(10);

The problem I'm having is the first set of threads spin up but the process doesn't wait until all threads finish their work and all 5000000 lines are complete. So the task that used to wait for completion for 32 hours now ends in <1min which messes up our system's state. Are there any alternative ways to do this? How can I make process wait on all threads completing?


Solution