spring-batchbatch-processingspring-batch-taskletspring-batch-job-monitoringspring-batch-integration

Scaling Spring Batch application with 30 Batch jobs


I’m working with a Spring Batch app that has around 30 jobs. Some jobs are dependent on others (e.g., Job B only runs after Job A completes), and all jobs run sequentially for ~500 accounts. To optimize, we’ve set this up as a StatefulSet and assigned accounts to specific pods (although the distribution isn’t perfect).

Eg: Pod0 -> 50 accounts

Pod1 -> 50 accounts ...

Each Job has to be executed for all the accounts. That means, JobA has to be executed for all the 50 accounts on Pod0 and similarly in each pod.

    accountShardService.getAccountIds().forEach(account
    -> jobs.stream()
           .forEach(job -> runJob(account, job)));

Challenges:

Some accounts have a ton of data, slowing down other jobs. Long-running jobs consume most resources, causing delays in subsequent scheduled jobs. for eg: Pod0 has to execute a job name Job-A for 50 accounts. If one of the account has huge data size to process, it simply takes most of the time and delays the execution of the remaining 49 accounts. What could be the best way to optimise this?

Also, we are thinking to go stateless, so any pod can pick up any job to improve flexibility. But I’m unsure how to set up HPA effectively—especially around what metrics to use to scale up/down based on job load. Because any pod's CPU and Memory will not be high if one account's job takes longer to process but it will delay the execution jobs for other accounts.

I’d love any advice on:

Good metrics for HPA in this setup Ways to dynamically assign accounts across pods without impacting job dependencies

Note: We are using external Postgres metadata job repository.


Solution

  • JobA has to be executed for all the 50 accounts on Pod0

    When you decide to run batch jobs at scale on kubernetes, you should not assume or enforce that a job must run on a specific pod. You should let kubernetes choose where to run your job.

    The fact that jobs have dependencies is what compromises scalability. You won't benefit from a real scalable batch architecture until your remove those dependencies (if you can't, you need a way to encapsulate the logic in a single unit of work).

    Each Job has to be executed for all the accounts

    In this case, I would first create a job of jobs (ie create a composite job using the JobStep concept in Spring Batch) to encapsulate the serial execution logic. Then, loop over all composite jobs and submit (to k8s) a job instance for each account. This way, jobs for different accounts can run in parallel, while jobs for a single account will be executed in serial.