AppFabric doesn’t recover well from restart

Asked 20/9, 2011 at 10:11 Answered 14/11, 2011 at 10:33

Alright, I’ve successfully deployed AppFabric, and everything was working nicely until we started getting an intermittent exception on the website:

ErrorCode < ERRCA0017 >:SubStatus < ES0007 >:There is a temporary failure. Please retry later. (The request failed because the server is in throttled state.)

At first I suspected the server was running low on memory (throttled state), but I eventually concluded that wasn’t the issue. In the event-log, I found DistributedCacheService.exe crashed every now and then, and it led me to a simple method of re-producing the error on my local development environment:

Start the website, add a few things to the cache.
Restart “AppFabric Caching Service”.
... and I start getting the error.

If I do a Get-CacheClusterHealth BEFORE restarting the service, it looks something like this:

NamedCache = MyCacheName
    Healthy              = 100,00
    UnderReconfiguration = 0,00
    NotPrimary           = 0,00
    NoWriteQuorum        = 0,00
    Throttled            = 0,00

After restarting:

Unallocated named cache fractions
---------------------------------

NamedCache = MyCacheName
    Unallocated fraction     = 100,00

While I get that result from Get-CacheClusterHealth, the site fails. From what I can tell, it corrects itself after a while (10+ minutes).

Is there any way to get AppFabric back on its feet faster?

Pennington answered 20/9, 2011 at 10:11 Comment(3)

Could you please publish the complete exception ? Details matter here :-) – Geosyncline 28/9, 2011 at 10:22

Did you had a look at msdn.microsoft.com/en-us/library/ff921020.aspx – Geosyncline 28/9, 2011 at 10:24

MS recommends that you have a separate cluster for appfabric caching servers msdn.microsoft.com/en-us/library/gg186017.aspx – Sarsaparilla 19/10, 2011 at 20:27

In short the answer is no.

The time a cluster takes to restart increases as you add extra nodes which leads me to believe that it is a node synchronisation process that takes the time.

The exception your seeing is indeed the appfabric node entering a throttled state. It will enter the throttled state depending on how you have the high/low watermarks set on the node. I think by default the high water mark is 90% after this time it will start evicting items depnding on the eviction policy that is set on the cache. You should generally use LRU (Least recently used) but if the cache still cannot run within the limits set it will throttle itself as to not bring your server down.

Your application would benefit if it could handle such events gracefully. If you have all nodes listed in the cluster config of your app then your app should move on to the next node on the next attempt to get data. We use a retry loop looking for the temporary failure and retrying 3 times. If after 3 times the error persists we log and return null, not an exeption. This allows the application to attempt accessing a different node or allowing the problem node time to recover:

 private object WithRetry(Func<object> method)
    {
        int tryCount = 0;
        bool done = false;
        object result = null;
        do
        {
            try
            {
                result = method();
                done = true;
            }
            catch (DataCacheException ex)
            {
                if (ex.ErrorCode == DataCacheErrorCode.KeyDoesNotExist)
                {
                    done = true;
                }
                else if ((ex.ErrorCode == DataCacheErrorCode.Timeout ||
                ex.ErrorCode == DataCacheErrorCode.RetryLater ||
                ex.ErrorCode == DataCacheErrorCode.ConnectionTerminated)
                && tryCount < MaxTryCount)
                {                        
                    tryCount++;
                    LogRetryException(ex, tryCount);
                }
                else
                {
                    LogException(ex);
                    done = true;
                }
            }
        }
        while (!done);


 return result;
}

And that allows us to do the following:

private void AF_Put(string key, object value)
{
    WithRetry(() => defaultCache.Put(key, value));
}

or:

private object AF_Get(string key)
{
    return WithRetry(() => defaultCache.Get(key));            
}

Speciosity answered 14/11, 2011 at 10:33 Comment(1)

Thanks. I've implemented something similar, where our site falls back to the ASP.NET cache if AppFabric is not responding. However, I found that AppFabric takes upwards of 10 seconds to figure out it's down, so immediately after the first failiure, I set a 10 minute timeout and all subsequent request are not sent to AppFabric. Clumsy, but it works. – Pennington 14/11, 2011 at 13:51

-4

This same/similar issue had happened with one of the projects I worked on. After two weeks of scratching our heads and unsuccessfully trying everything to get our WCF services running (on Azure) we ended up calling Microsoft.

The tech guys from Microsoft did supply us with a (Power)Shell script that is ran from the site's runtime that does health+maintenance of the AppFabric... The script had stuff that I hadn't seen on Azure books at all, but it did get the job done properly!

Thanks

Handcar answered 10/11, 2011 at 15:42 Comment(2)

But you don't include the script - so what is the answer contained in this "answer"? – Leucas 10/11, 2011 at 20:26

Please supply script as i am having simialar issue. – Orts 2/3, 2012 at 17:7

Recommended topics

Hot tags