node.jspostgresqlamazon-rdsnode-postgres

How to optimize Postgresql max_connections and node-postgres connection pool?


In brief, I am having trouble supporting more than 5000 read requests per minute from a data API leveraging Postgresql, Node.js, and node-postgres. The bottleneck appears to be in between the API and the DB. Here are the implmentation details.

I'm using an AWS Postgresql RDS database instance (m4.4xlarge - 64 GB mem, 16 vCPUs, 350 GB SSD, no provisioned IOPS) for a Node.js powered data API. By default the RDS's max_connections=5000. The node API is load-balanced across two clusters with 4 processes each (2 Ec2s with 4 vCPUs running the API with PM2 in cluster-mode). I use node-postgres to bind the API to the Postgresql RDS, and am attempting to use it's connection pooling feature. Below is a sample of my connection pool code:

var pool = new Pool({
    user: settings.database.username,
    password: settings.database.password,
    host: settings.database.readServer,
    database: settings.database.database,
    max: 25, 
    idleTimeoutMillis: 1000
});

/* Example of pool usage */
pool.query('SELECT my_column FROM my_table', function(err, result){
    
    /* Callback code here */
});

Using this implementation and testing with a load tester, I can support about 5000 requests over the course of one minute, with an average response time of about 190ms (which is what I expect). As soon as I fire off more than 5000 requests per minute, my response time increases to over 1200ms in the best of cases and in the worst of cases the API begins to frequently timeout. Monitoring indicates that for the EC2s running the Node.js API, CPU utilization remains below 10%. Thus my focus is on the DB and the API's binding to the DB.

I have attempted to increase (and decrease for that matter) the node-postgres "max" connections setting, but there was no change in the API response/timeout behavior. I've also tried provisioned IOPS on the RDS, but no improvement. Also, interestingly, I scaled the RDS up to m4.10xlarge (160 GB mem, 40 vCPUs), and while the RDS CPU utilization dropped greatly, the overall performance of the API worsed considerably (couldn't even support the 5000 requests per minute that I was able to with the smaller RDS).

I'm in unfamilar territory in many respects and am unsure of how to best determine which of these moving parts is bottlenecking API performance when over 5000 requests per minute. As noted I have attempted a variety of adjustments based on the review of Postgresql configuration documentation and node-postgres documentation, but to no avail.

If anyone has advice on how to diagnose or optimize I would greatly appreciate it.

UPDATE

After scaling up to m4.10xlarge, i performed a series of load-tests, varying the number of request/min and the max number of connections in each pool. Here are some screen captures of monitoring metrics:

monitoring metrics

db connections


Solution

  • In order to support more then 5k requests, while maintaining the same response rate, you'll need better hardware...

    The simple math states that: 5000 requests*190ms avg = 950k ms divided into 16 cores ~ 60k ms per core which basically means your system was highly loaded.
    (I'm guessing you had some spare CPU as some time was lost on networking)

    Now, the really interesting part in your question comes from the scale up attempt: m4.10xlarge (160 GB mem, 40 vCPUs).
    The drop in CPU utilization indicates that the scale up freed DB time resources - So you need to push more requests!
    2 suggestions: