ruby-on-railsrubyconcurrencycelluloid

Dynamically assigning actors in Celluloid


I'm learning how to use Celluloid. I’ve read all the documentation and think I have the idea of how to use it but lack practise. I'm about to test it with a CSV file with almost 12,000 rows. 

I’m unsure how many actors I should assign to a job. I'm guessing this number should be dynamic. According to this railscasts episode the default number is set to the number of cores in your machine, but surely you should change this number based on your workload?

I have 12,000 records to get through, if I execute the code below I'm guessing it will initiate all the actors in my pool and queue them up to handle the jobs. But how should I gauge how many actors to dynamically assign to the work?

There are still many holes in my understanding, so feel free to challenge my whole implementation.

class Model < ActiveRecord::Base
  include Celluloid
  def initialize(row)
    self.name = row[0]
    self.alt_id = row[1]
    self.definition = row[2]
    self.save
    self.terminate
  end    
end

CSV.open("./files/my_file.csv", "wb") do |csv|
  Model.supervise(csv)
end

Solution

  • First, in your case you should create a different class for your actor.

    class Model < ActiveRecord::Base
      def self.save_from_csv(row)
        new.tap do |m|
          m.name = row[0]
          m.alt_id = row[1]
          m.definition = row[2]
          m.save
        end
      end    
    end
    
    class CSVWorker
      include Celluloid
    
      def persist_from_csv(row)
        Model.persist_from_csv(row)
      end
    end
    

    Then you can create a pool and do the work for each row.

    pool = CSVWorker.pool(size: 4)
    CSV.foreach("./files/my_file.csv") do |row|
      pool.async.persist_from_csv(row)
    end
    

    Notice the async. That's what makes it run in pseudo parallel.

    I admit I haven't tested this, but even if it Works™, you should benchmark it to see if there's actually any gain from paralysation. I doubt that it will be much faster in MRI because the only IO involved is DB queries.