Through iostat
I could find spikes in disk writes once per minute. I think these spikes are caused by fsync
, since MongoDB by default flush data to disk every 60 seconds.
I could also find out that coinciding with the spikes slow queries are appearing in the slow query log.
There is an option called storage.syncPeriodSecs, but according to the docs:
Do not set this value on production systems. In almost every situation, you should use the default setting.
Executing fsync more often could reduce the spikes, so I would like to know the risks of changing this value in production.
I would appreciate any thoughts.
Details:
MongoDB version: 3.2.16
Storage engine: WiredTiger
Slow queries during the spike: A couple of them, of around 1second. Not stalling the server
Deployment: Sharded cluster. Replica sets with two members (primary + secondary)
Specs: CPU 8 cores, Memory 64GB, SSD disk
The WiredTiger storage engine performs a checkpoint every 60 seconds. It sounds like your deployment is struggling during these checkpoint events. This behaviour you're seeing is typical if you're doing a heavy write load into a hardware that is (could be) under-provisioned.
It is usually not recommended to change the syncPeriodSecs
value in a production environment since the default value is deemed to be the right balance between memory usage, the number of fsync events, the potential of losing data between fsync events in a crash, and other considerations in a typical hardware configuration.
Changing this value could make the stalls worse. You can of course do some experiment by changing it (lower or higher) to see if it can "smooth out" the fsync events. Having said that, this is an advanced tuning mechanism that is best reserved when other options are not available anymore. If possible, stalls like these can usually be solved by provisioning better hardware, since it appears that the current hardware is struggling under the load you're expecting it to handle.