appfabricappfabric-cache

Troubleshooting AppFabric Scaling Issues (Intermittent ErrorCode<ERRCA0017>:SubStatus<ES0006> Errors)


We've implemented AppFabric Windows Server Cache for our web application. Initially, we were able to use the cache without any issues. We then increased traffic roughly 100 fold, and began experiencing intermittent exceptions. The exceptions occur roughly once per 2 days, for about a minute at a time.

Our configuration:

The errors in the order that they occur (the exceptions are occur for each of the nine webservers during the 1 minute period):

We have also created a tracelog session on the caching server to capture more information to diagnose the issue - any suggestions on how to analyze this would be appreciated (I can make this available if need be).

We also monitored various AppFabric, CLR, and Network performance counters, below is a screenshot of the event as it occurs:

AppFabric Perfmon Capture

Thanks in advance for any thoughts or advice you can share on resolving this issue.

UPDATE 1

The following are the exceptions occurring continuously on the AppFabric Caching Server during the intermittent errors (abstracted from tracelogs) :

UPDATE 2

After another day of troubleshooting we took the following actions which produced some improvement:

  1. Based on this and this we increased maxConnectionsToServer to 3. As a result we gained 50% more client requests/sec as recorded by the AppFabric Caching:Cache perf counter, but the intermittent errors did not stop occuring

  2. We increased the maxBufferSize and maxBufferPoolSize to 2147483647 (int32.max) on the Cache Server configuration. So far we are able to handle 300x traffic volume w/o errors. We will continue to increase traffic volume and monitor. More updates to follow

UPDATE 3

We added two more hosts with 16GB each to the cluster and enabled HighAvailability mode (via Secondaries=1). Currently the original host remains in the cluster with 96GB ram - all hosts have cacheSize = 12GB. On the cache clients we increase the MaxConnectionToServer to 12 (1 per core). Below are our findings:

We plan to remove 80GB ram from the original cache host. More updates to follow.

UPDATE 4

The problem seems to have been solved by reducing the amount of RAM in the cache hosts to 16GB. We no longer see the intermittent errors with traffic increased to 400x. Seems to be cased closed. Now on to the next issue: High Availability


Solution

  • Reposting an answer given by Jeff-ITGuy on social.msdn.microsoft.com:

    You appear to be encountering an issue nearly identical to one I'm working with Microsoft at the moment. If it's the same issue, it is probably caused by GC taking a long time and causing delays in the response time for AppFabric. From your perf counters it looked like GC time shot up when you started getting the problem so it probably is the same issue.

    This issue is being investigated actively by Microsoft. In the meantime, in order to alleviate the problem (at least from our findings) you can run more servers with less memory (shrink the size of the memory space that GC is working against) and you can increase the RequestTimeout on your client. By default that is set to 15000 (15 secs) but we have tried raising it to 30000 which helped eliminate some of the issues. This is NOT a good long term solution in my opinion, just passing on information. I've seen the issue with servers having only 24gb memory (with 12gb for cache) and it only really got noticeably better when we tried 8gb servers with 4gb set to cache - that caused GC to do MUCH better.

    Hope that helps, but if this is the issue I think it is there's no solution yet.

    It did help, the intermittent errors stopped after we reduced the cache host RAM to 16GB.