Mongoid/Moped errors when processing jobs with Sidekiq
Asked Answered
I

1

9

I am seeing a ton of jobs failing when running Sidekiq that are all related to connectivity issues with my Mongo database. I'm stress testing the machine with lots of load, so I queued over 18,000 jobs each with a 5 second retry upon failure. Some of the jobs (I'm guessing the ones with a worker that was able to successfully retrieve a connection thread) worked out just fine. Then I have a ton of others with errors like this:

2013-12-26T19:25:35Z 13447 TID-m2biw WARN: {"retry"=>true, "queue"=>"default",
"class"=>"ParseFileWorker", "args"=>[{"filename"=>"/opt/media/AE-67452_36.XML"}], 
"jid"=>"5d6c48fa94ed8c8da1b226fc", "enqueued_at"=>1388084903.6113343, 
"error_message"=>"Could not connect to a primary node for replica set #
<Moped::Cluster:44552140 @seeds=[<Moped::Node resolved_address=\"127.0.0.1:27017\">]>", 
"error_class"=>"Moped::Errors::ConnectionFailure", "failed_at"=>"2013-12-26T19:16:25Z", 
"retry_count"=>3, "retried_at"=>2013-12-26 19:25:35 UTC}

There's also timeout errors such as the following:

Timeout::Error: Waited 0.5 sec

Note, I'm running Rails 4 with Mongoid code checked out from the master branch in https://github.com/mongoid/mongoid. From what I read, previous versions of Mongoid required explicit closing of connections when a Sidekiq job was done processing. Mongoid 4 is supposed to do that automatically. I can't confirm if it's doing it or not. It seems the problem is two fold when jobs get queued too fast connections aren't available or are timing out. Some workers are successful in opening a connection. Some jobs have to wait until a retry to parse.

I'm not entirely sure what the best solution here is. If I queue jobs slowly, everything works fine. Unfortunately that's not how the app will function in the real world.

How do I ensure my sidekiq jobs get a Moped/Mongoid connection they need to get the job done, or wait until one is available to process? Seeing a ton of exceptions/errors because of this issue. Below is a stack trace of the common, recurring error:

2013-12-26T20:56:58Z 1317 TID-i70ms ParseFileWorker JID-0fc375d8fd9707e7d3ec3238 INFO: fail: 1.507 sec
2013-12-26T20:56:58Z 1317 TID-i70ms WARN: {"retry"=>true, "queue"=>"default", "class"=>"ParseFileWorker", "args"=>[{"filename"=>"/opt/media/AC-19269_287.XML"}], "jid"=>"0fc375d8fd9707e7d3ec3238", "enqueued_at"=>1388091410.0421875, "error_message"=>"Waited 0.5 sec", "error_class"=>"Timeout::Error", "failed_at"=>2013-12-26 20:56:58 UTC, "retry_count"=>0}
2013-12-26T20:56:58Z 1317 TID-i70ms WARN: Waited 0.5 sec
2013-12-26T20:56:58Z 1317 TID-i70ms WARN: /media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/connection_pool-1.2.0/lib/connection_pool/timed_stack.rb:35:in `block (2 levels) in pop'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/connection_pool-1.2.0/lib/connection_pool/timed_stack.rb:31:in `loop'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/connection_pool-1.2.0/lib/connection_pool/timed_stack.rb:31:in `block in pop'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/connection_pool-1.2.0/lib/connection_pool/timed_stack.rb:30:in `synchronize'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/connection_pool-1.2.0/lib/connection_pool/timed_stack.rb:30:in `pop'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/connection_pool-1.2.0/lib/connection_pool.rb:66:in `checkout'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/connection_pool-1.2.0/lib/connection_pool.rb:53:in `with'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/node.rb:114:in `connection'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/node.rb:141:in `disconnect'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/node.rb:156:in `down!'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/node.rb:442:in `rescue in refresh'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/node.rb:431:in `refresh'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/cluster.rb:182:in `block in refresh'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/cluster.rb:194:in `each'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/cluster.rb:194:in `refresh'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/cluster.rb:151:in `nodes'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/cluster.rb:240:in `with_primary'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/read_preference/primary.rb:55:in `block in with_node'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/read_preference/selectable.rb:65:in `call'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/read_preference/selectable.rb:65:in `with_retry'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/read_preference/primary.rb:54:in `with_node'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/gems/moped-2.0.0.beta4/lib/moped/query.rb:127:in `first'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/bundler/gems/mongoid-fc589421bf8b/lib/mongoid/contextual/mongo.rb:201:in `block (2 levels) in first'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/bundler/gems/mongoid-fc589421bf8b/lib/mongoid/contextual/mongo.rb:537:in `with_sorting'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/bundler/gems/mongoid-fc589421bf8b/lib/mongoid/contextual/mongo.rb:200:in `block in first'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/bundler/gems/mongoid-fc589421bf8b/lib/mongoid/contextual/mongo.rb:449:in `try_cache'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/bundler/gems/mongoid-fc589421bf8b/lib/mongoid/contextual/mongo.rb:199:in `first'
/media/apps/tsn-parser-server/vendor/bundle/ruby/2.0.0/bundler/gems/mongoid-fc589421bf8b/lib/mongoid/contextual.rb:19:in `first'
/media/apps/tsn-parser-server/app/models/parser/core.rb:88:in `prepare_parse'

** UPDATE **

If I lower the number of sidekiq workers to 10 or so, the issue stops from happening -- simply because there's less workers opening a connection to MongoDB with only 10 threads running. There has to be a way to use a connection pool for Mongoid/Moped in sidekiq. Since Rails is in charge of setting up and tearing down the connection, I'm not sure how to make this happen.

** UPDATE 2 **

When I point this to our production server, it becomes 10x worse. Almost every single thread is a Timeout::Error. Some get through, but not many at all. This is pretty awful. The problem can at least be "worked around" by limiting sidekiq to 10 workers when MongoDB is local. But this is not a realistic use case and I need to be able to connect to a database on a separate machine.

Isleen answered 26/12, 2013 at 19:39 Comment(12)
without seeing some diagnostics from mongod side of things, it's hard to tell whether there is something else you can do to resolve this issue. but to begin with: you are connecting to a replica set, yes? how many connections to the primary exist when the problem first manifests? what does the mongod log say about it? are you maxing out the number of connections to mongod? (20K limit) are you reaching the limit of number of connections the OS will let mongod accept (related to OS settings).Justle
This is not a replica set. I have a single MongoDB server that's responsible for all reads/writes currently. I have to collect connection information on mongod, but I highly doubt I'm coming anywhere remotely close to the 20K limit. As far as the OS goes, all ulimits seem to be appropriately set. Open files is set to 64,000.Isleen
good to know you sanity checked those - if you look in mongod log every time a connection opens and closes there should be a pair of "connection accepted from" and "end connection" lines - both include how many total connections there now are. Do those look sane?Justle
Welp, there's one hint. [initandlisten] connection accepted from 10.141.178.238:52993 #530706 (111 connections now open) 111 seems super excessive. It also doesn't want to go above that many connections. The question I have to figure out now though is where the heck those 111 connections are coming from.Isleen
start up mongo shell and run db.currentOp(true) or you can look at all the "initandlisten" lines in your logs. 111 is not very high - is it simply a lot higher than what you would expect given your application code?Justle
I figured out why there was 111 connections, it's fine, it's from the app servers. I'm definitely still replicating the issue in my original post even after I've disconnected those connections. So it can be replicated super trivial. I'm almost tempted to port my app to Rails 3 and Mongoid 3 vs whatever is in the master branch and test from there...Isleen
is authentication enabled on your mongod server?Justle
nope - zero authentication. just have security groups setup on AWS -- that's it though.Isleen
what's your connection pool size configured to be?Justle
From what I can tell, Mongoid nor Moped have an explicit way to set a connection pool. This seems extremely problematic -- I'm not sure what the best way of handling a MongoDB connection pool using this tool with Sidekiq which runs threaded workers that each require use of said connection pool.Isleen
you can set pool size - see comments in this thread for examples: github.com/mongoid/moped/issues/223#issuecomment-26142703 'pool_size': 35 in that comment. It appears the default pool_size is 5 github.com/mongoid/moped/blob/… which would be woefully inadequate for your heavy load, I'd guess. Your comment about each worker getting its own pool is confusing to me - wouldn't they be sharing connections across all threads?Justle
Looks like increasing the pool_size in mongoid.yml was the key! Want to answer the question with that so I can reward you? Now my workers are taking ~15 seconds a pop to process jobs, and the time to process goes down as I lower the # of sidekiq workers, but I guess that's another issue.Isleen
J
12

First thing to check would be mongod connection limit and OS limits on the DB side, since they were fine it seemed likely that connection pool size may have been the limiting factor.

It turns out that mongoid connection pool size default was 5 which is quite low for your large number of workers. Raising those as shown in an example here will allow more workers to get a connection simultaneously.

Justle answered 29/12, 2013 at 8:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.