I've been given maintenance responsibilities over an API that uses a MariaDB (10.0.21) db.t3.2xlarge. Today, for no apparent reason (checked the number of requests coming to the api and no anomaly, no code changes, no database changes) the database started becoming slow to the point where even getting an access token to the api took between 4-6 seconds (normally it would be under 500ms). From performance insights (for the last 5h) I can see that for the database load by waits (AAS) a big majority of the wait is for "synch/mutex/aria/PAGECACHE::cache_lock"(29.92) followed by CPU (13.87) and the SQL that creates all of this is a select from a view that worked just fine up to this point. The graphs: https://imgur.com/B5baGzx https://imgur.com/BTU6MAf
All the graphs I could get on the RDS: https://imgur.com/a/8XO04gZ
The analyze of the select: https://imgur.com/S9uQJpW
I'm not a database expert but I guess some of this is not done properly. The question though still remains.. why does this happen all of a sudden? We do have automatic minor update for the engine but I checked and there were not updates in the past week. I've also checked the process list for the db and could see the select from v_links query stuck in "Sending data"
I've also run the query that was stuck in "sending data" from the console and that finished in 10ms.
I've created a thread on aws forums as well: https://forums.aws.amazon.com/thread.jspa?messageID=921037
After getting in touch with AWS support and long investigations we discovered that the max_tmp_table was set to 32 and since we had a large number of concurrent requests on a view that created multiple temporary tables, that 32 was too small and tmp tables started to spill to disk hence the aria cache wait time increased until the point of hanging.