ruby-on-railsrubyconcurrencybigdatasidekiq

Ruby + sidekiq - best solution for execute and handle big data


Imagine we have 10k entities-x. For each entity-x we should make async api call. Each api call returns 100 entities-y. Then in total we have 10k * 100 = 1_000_000 entities-y. For each entity-y we should make another async api call and get the result. Question - what is the best way to do this?

For context: my core have 16 threads.

My first thought was to separate entities-x (10k) between threads so each thread could handle its own number of entities-x and entities-y. For example if we have 10k entities-x then we could divide it by 16 (number of threads) and gave the result to every thread. But then I understood that ruby can run only one thread at the time. Although sidekiq runs jobs concurrently on separate threads.

Then my thought was to separate entities-x between sidekiq jobs so if we have 10k entities-x then we could divide it by 16 (number of threads) and gave the result to every job. But I don't know about that. In theory we can use more number of jobs than number of threads and I don't know does this will be more efficient or not. What do you think?


Solution

  • I would create two Sidekiq jobs. One for fetching an entity-x and enqueueing another job for each entities-y returned. And the other job then fetches an entify-y and processing it.

    Basically like this (pseudocode):

    class EntityXJob
      include Sidekiq::Job
    
      def perform(entity_x)
        response = fetch(entity_x)
    
        response.entities_y.each do |entity_y|
          EntityYJob.perform_async(entity_y)
        end
      end
    end
    
    class EntityYJob
      include Sidekiq::Job
    
      def perform(entity_y)
        response = fetch(entity_y)
        process(response)
      end
    end
    

    To start processing everything, you need to enqueue one EntityXJob for each entity_x. How to do that depends on where you can a list of all entities_x from (another API request, already stored in your DB, or a config file) and how you want to trigger processing (action in a controller, another background job, CRON). If you had those IDs in the DB, then you could, for example, trigger enqueueing all jobs with Rails Runner like this:

    rails runner "EntityX.find_each { |entity_x| EntityXJob.perform_async(entity_x) }"
    

    Processing those jobs one by one in Sidekiq allows you to monitor progress in the Sidekiq WebUI, and failed job are automatically retried by Sidekiq in its default configuration.

    What number of workers works best for you, depends on how long those jobs have to wait for IO from the API request and how complex the processing of each job is. On a machine with 16 cores, I would configure at least 16 Sidekiq processes, probably more, because most workers will likely be idle and waiting for API responses most of the time. The limiting factor in your example will likely be RAM and not CPU cores.

    Also keep in mind that the API might have a rate limit. In that case, a different approach might be required to ensure that the workers do not max out the API request rate limit.


    Concurrency vs. parallelism

    There is a difference between Sidekiq processes (I wrote workers above) and Sidekiq threads. In short and simplified:

    Because of the GIL (global interpreter lock), One Ruby process can only run one thread at a time on one CPU core. When a thread is waiting, for example, for IO then another thread can be run while the other is waiting or paused. This is concurrency. But multiple Sidekiq processes can run on different CPU cores, which allows real parallelism.

    The general rule of thumb to achieve maximum performance with Sidekiq is: Start 1 Sidekiq process per available CPU core to maximize parallelism. Then fine-tune the number of threads per process according to your jobs' workload pattern and available memory to maximize concurrency in each process.