azureredistimeoutstackexchange.redis

Random Azure Redis timeouts


I've seen a number of questions on this topic, but I can't seem to find the one that addresses the odd issues I'm seeing.

We have a brand new set up in Azure (two actually for Sandbox vs Production). Each subscription has its own Azure Redis environment. Since these are new instances, we aren't running traffic through them and yet I'm still getting random alerts from our system when Redis times out from the standard ping health checks that run every few minutes. The application and the Redis instance are in the same region (East US).

Sandbox is using C1 Basic and Production is using C1 Standard. Both are seeing these timeouts.

I read up on options to address timeouts and I saw on MSFT that we should move our Redis calls to be async. After making this change (and all the related changes up the chain through the applications) we are still seeing very random timeouts.

I've included the exception details below. We are capturing the exception and logging it we are just treating it as a cache miss so the user isn't interrupted other than the 5+ second delay.

For context, this is not my first time using Azure Redis, but this is certainly the first time encountering this strange issue. What are we missing here?

StackExchange.Redis.RedisTimeoutException Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5203ms elapsed, timeout is 5000ms), command=GET, next: GET key, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 0, cur-in: 0, sync-ops: 24, async-ops: 272, serverEndpoint: instance.redis.cache.windows.net:6380, conn-sec: 8781.56, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: client(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=1022,Min=2,Max=1023), POOL: (Threads=2,QueuedItems=0,CompletedItems=55237,Timers=5), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Update - I also tried increasing the thread pool to 300 with no improvement. The error message does reflect the larger min size.


Solution

  • Hey I am going to help you out here. I just finished a SEV A case with MSFT and this was their response.

    Starting around the beginning of 2023-08 we started seeing high server load for some smaller size caches. Investigation revealed the cause to be a change in behavior of one of the Azure security monitoring services. The service started to read the entire local event log every hour. This log file can reach substantial size, so reading it entirely generates lots of I/O and can significantly affect CPU usage for smaller size caches. The team in charge of this service is still conducting investigations to get to root cause of the problem. In the mean time we have identified a mitigation that reduces the growth rate of the log file by approximately 50%. Combined with the normal monthly cache update cadence that also resets the log file during the update process, this will prevent the problem from recurring for the majority of caches. This mitigation will be released with the upcoming monthly service updates that should be fully deployed to all regions by mid October.