We're nearly done with our upgrade from Rails 4.2.11.1 to Rails 5.2.3. All tests pass locally and no problems are encountered when running the app locally. On staging (and our failed attempt at updating production) we encounter a very strange error where the number of connections to our database grows and grows until the app becomes unresponsive.
The problem is seen sporadically on staging (hard to reproduce), but very consistent in our production environment. I think the main difference is the amount of traffic, since our staging servers are only used for our testing whereas the production environment was receiving traffic from several thousand users. I'm attempting to do some load testing on staging to recreate the issue so that we have a chance of pinpointing the problem, but for now we're at a loss as to why this is happening.
It's like the code is not reusing already established connections to the database each time it needs one, and instead creates a new one, uses it once (presumably), and then forgets about it. Has anyone else experienced a problem like this? We would really appreciate any help or insight anyone can provide.
Our specific software versions are listed below:
Gemfile
source 'http://rubygems.org'
ruby '2.6.5'
gem 'nokogiri'
gem 'rails', '5.2.3'
gem 'rake'
gem 'mysql2'
gem 'mini_magick'
gem 'liquid', '~> 3.0.1'
gem 'httparty'
gem 'mime-types', '>=1.16', require: 'mime/types'
gem 'oauth'
gem 'oauth2', '~> 1.4'
gem 'roxml'
gem 'sax-machine'
gem 'googlecharts', '1.6.7', require: 'gchart'
gem 'rubyzip', '~> 1.2.1', require: 'zip'
gem 'ruby-openid'
gem 'RedCloth', '~> 4.2.9'
gem 'scoped_search', '~> 4.1.3'
gem 'wicked_pdf', '~> 1.1.0'
gem 'wkhtmltopdf-binary-edge', '~> 0.12.4.0'
gem 'devise'
gem 'devise-encryptable'
gem 'paperclip'
gem 'aws-sdk'
gem 'whois', '~> 2.0.4'
gem 'validates_timeliness'
gem 'will_paginate'
gem 'escape_utils'
gem 'acts_as_list'
gem 'acts_as_paranoid'
gem 'prawn', '~> 2.2.2'
gem 'prawn-table', '~> 0.2.2'
gem 'net-dns', '~> 0.6.1'
gem 'in_place_editing', '~> 1.1.2'
gem 'aasm', '~> 3.0.26'
gem 'dynamic_form', '~> 1.1.4'
gem 'delayed_job_active_record', '~> 4.1', '>= 4.1.2'
gem 'daemons'
gem 'carmen-rails', github: 'jim/carmen-rails'
gem 'dalli'
gem 'Ascii85', '1.0.1'
gem 'rinku', require: 'rails_rinku'
gem 'jquery-rails'
gem 'sass-rails'
gem 'compass-rails', '3.0.2'
gem 'compass-rgbapng', '0.2.1', require: 'rgbapng'
gem 'fancy-buttons', '~> 1.1.1'
gem 'haml-rails'
gem 'haml'
gem 'slim-rails'
gem 'coffee-rails'
gem 'uglifier', '~> 2.3.1'
gem 'holidays', '~> 1.0.5'
gem 'rack-utf8_sanitizer'
gem 'autoprefixer-rails'
gem 'loofah'
gem 'yajl-ruby', require: 'yajl'
gem 'gretel'
gem 'crack'
gem 'lhm'
gem 'twilio-ruby'
gem 'open_uri_redirections'
gem 'godaddy-api', '~> 0.1.0'
gem 'html_truncator', '~> 0.4.1'
gem 'acme-client'
gem 'net-ssh'
gem 'mustache', "~> 1.0"
gem 'net-sftp', '~> 2.1', '>= 2.1.2'
gem 'activeresource'
gem 'geokit', '~> 1.13.1'
gem 'fomo', '~> 0.1.0'
gem 'nameable', '~> 1.1'
gem 'xmlrpc'
gem 'activemodel-serializers-xml'
gem 'airbrake', '~> 9.4', '>= 9.4.5'
It turns out that this problem had nothing to do with our number of max connections, and everything to do with how Rails handles database connections outside of the normal ActionController request/response cycle.
We had some logging code that was inadvertently loading an ActiveRecord object on each request, but was happening before the regular request/response lifecycle of a regular controller. This in turn would open a database connection that would never be closed. Our connections would continue to increase until we finally reached our maximum and the instance would begin to hang because it was waiting for a new connection.
Moral of the story: make sure any calls you make to the database are called within the normal request/response lifecycle, or use
Rails.application.executor.wrap do
...
end
to wrap your code that makes the database call so the connection is properly returned to the connection pool.