ruby-on-railsmongodbmongoidsidekiqmoped

Mongoid/Moped errors when processing jobs with Sidekiq


I am seeing a ton of jobs failing when running Sidekiq that are all related to connectivity issues with my Mongo database. I'm stress testing the machine with lots of load, so I queued over 18,000 jobs each with a 5 second retry upon failure. Some of the jobs (I'm guessing the ones with a worker that was able to successfully retrieve a connection thread) worked out just fine. Then I have a ton of others with errors like this:

2013-12-26T19:25:35Z 13447 TID-m2biw WARN: {"retry"=>true, "queue"=>"default",
"class"=>"ParseFileWorker", "args"=>[{"filename"=>"/opt/media/AE-67452_36.XML"}], 
"jid"=>"5d6c48fa94ed8c8da1b226fc", "enqueued_at"=>1388084903.6113343, 
"error_message"=>"Could not connect to a primary node for replica set #
<Moped::Cluster:44552140 @seeds=[<Moped::Node resolved_address=\"127.0.0.1:27017\">]>", 
"error_class"=>"Moped::Errors::ConnectionFailure", "failed_at"=>"2013-12-26T19:16:25Z", 
"retry_count"=>3, "retried_at"=>2013-12-26 19:25:35 UTC}

There's also timeout errors such as the following:

Timeout::Error: Waited 0.5 sec

Note, I'm running Rails 4 with Mongoid code checked out from the master branch in https://github.com/mongoid/mongoid. From what I read, previous versions of Mongoid required explicit closing of connections when a Sidekiq job was done processing. Mongoid 4 is supposed to do that automatically. I can't confirm if it's doing it or not. It seems the problem is two fold when jobs get queued too fast connections aren't available or are timing out. Some workers are successful in opening a connection. Some jobs have to wait until a retry to parse.

I'm not entirely sure what the best solution here is. If I queue jobs slowly, everything works fine. Unfortunately that's not how the app will function in the real world.

How do I ensure my sidekiq jobs get a Moped/Mongoid connection they need to get the job done, or wait until one is available to process? Seeing a ton of exceptions/errors because of this issue. Below is a stack trace of the common, recurring error:

2013-12-26T20:56:58Z 1317 TID-i70ms ParseFileWorker JID-0fc375d8fd9707e7d3ec3238 INFO: fail: 1.507 sec
2013-12-26T20:56:58Z 1317 TID-i70ms WARN: {"retry"=>true, "queue"=>"default", "class"=>"ParseFileWorker", "args"=>[{"filename"=>"/opt/media/AC-19269_287.XML"}], "jid"=>"0fc375d8fd9707e7d3ec3238", "enqueued_at"=>1388091410.0421875, "error_message"=>"Waited 0.5 sec", "error_class"=>"Timeout::Error", "failed_at"=>2013-12-26 20:56:58 UTC, "retry_count"=>0}
2013-12-26T20:56:58Z 1317 TID-i70ms WARN: Waited 0.5 sec
2013-12-26T20:56:58Z 1317 TID-i70ms WARN: /media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/connection_pool-1.2.0/lib/connection_pool/timed_stack.rb:35:in `block (2 levels) in pop'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/connection_pool-1.2.0/lib/connection_pool/timed_stack.rb:31:in `loop'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/connection_pool-1.2.0/lib/connection_pool/timed_stack.rb:31:in `block in pop'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/connection_pool-1.2.0/lib/connection_pool/timed_stack.rb:30:in `synchronize'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/connection_pool-1.2.0/lib/connection_pool/timed_stack.rb:30:in `pop'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/connection_pool-1.2.0/lib/connection_pool.rb:66:in `checkout'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/connection_pool-1.2.0/lib/connection_pool.rb:53:in `with'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/node.rb:114:in `connection'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/node.rb:141:in `disconnect'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/node.rb:156:in `down!'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/node.rb:442:in `rescue in refresh'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/node.rb:431:in `refresh'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/cluster.rb:182:in `block in refresh'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/cluster.rb:194:in `each'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/cluster.rb:194:in `refresh'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/cluster.rb:151:in `nodes'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/cluster.rb:240:in `with_primary'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/read_preference/primary.rb:55:in `block in with_node'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/read_preference/selectable.rb:65:in `call'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/read_preference/selectable.rb:65:in `with_retry'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/read_preference/primary.rb:54:in `with_node'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/query.rb:127:in `first'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/bundler/gems/mongoid-fc589421bf8b/lib/mongoid/contextual/mongo.rb:201:in `block (2 levels) in first'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/bundler/gems/mongoid-fc589421bf8b/lib/mongoid/contextual/mongo.rb:537:in `with_sorting'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/bundler/gems/mongoid-fc589421bf8b/lib/mongoid/contextual/mongo.rb:200:in `block in first'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/bundler/gems/mongoid-fc589421bf8b/lib/mongoid/contextual/mongo.rb:449:in `try_cache'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/bundler/gems/mongoid-fc589421bf8b/lib/mongoid/contextual/mongo.rb:199:in `first'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/bundler/gems/mongoid-fc589421bf8b/lib/mongoid/contextual.rb:19:in `first'
/media/apps/tsn-parser-server/app/models/parser/core.rb:88:in `prepare_parse'

** UPDATE **

If I lower the number of sidekiq workers to 10 or so, the issue stops from happening -- simply because there's less workers opening a connection to MongoDB with only 10 threads running. There has to be a way to use a connection pool for Mongoid/Moped in sidekiq. Since Rails is in charge of setting up and tearing down the connection, I'm not sure how to make this happen.

** UPDATE 2 **

When I point this to our production server, it becomes 10x worse. Almost every single thread is a Timeout::Error. Some get through, but not many at all. This is pretty awful. The problem can at least be "worked around" by limiting sidekiq to 10 workers when MongoDB is local. But this is not a realistic use case and I need to be able to connect to a database on a separate machine.


Solution

  • First thing to check would be mongod connection limit and OS limits on the DB side, since they were fine it seemed likely that connection pool size may have been the limiting factor.

    It turns out that mongoid connection pool size default was 5 which is quite low for your large number of workers. Raising those as shown in an example here will allow more workers to get a connection simultaneously.