Creating a High Availability AppFabric Cache Cluster

Asked 11/9, 2012 at 20:12 Answered 20/9, 2012 at 17:6

Is there anything aside from setting Secondaries=1 in the cluster configuration to enable HighAvailability, specifically on the cache client configuration?

Our configuration:

Cache Cluster (3 windows enterprise hosts using a SQL configuration provider):
Cache Clients

With the about configuration, we see primary and secondary regions created on the three hosts, however when one of the hosts is stopped, the following exceptions occur:

ErrorCode<ERRCA0018>:SubStatus<ES0001>:The request timed out.
An existing connection was forcibly closed by the remote host
No connection could be made because the target machine actively refused it 192.22.0.34:22233
An existing connection was forcibly closed by the remote host

Isn't the point of High Availability to be able to handle hosts going down without interrupting service? We are using a named region - does this break the High Availability? I read somewhere that named regions can only exist on one host (I did verify that a secondary does exist on another host). I feel like we're missing something for the cache client configuration would enable High Availability, any insight on the matter would be greatly appreciated.

Flanagan answered 11/9, 2012 at 20:12 Comment(0)

After opening a ticket with Microsoft we narrowed it down to having a static DataCacheFactory object.

public class AppFabricCacheProvider : ICacheProvider
{
    private static readonly object Locker = new object();
    private static AppFabricCacheProvider _instance;
    private static DataCache _cache;

    private AppFabricCacheProvider()
    {
    }

    public static AppFabricCacheProvider GetInstance()
    {
        lock (Locker)
        {
            if (_instance == null)
            {
                _instance = new AppFabricCacheProvider();
                var factory = new DataCacheFactory();
                _cache = factory.GetCache("AdMatter");
            }
        }
        return _instance;
    }
    ...
}

Looking at the tracelogs from AppFabric, the clients are still trying to connect to all the hosts without handling hosts going down. Resetting IIS on the clients forces a new DataCacheFactory to be created (in our App_Start) and stops the exceptions.

The MS engineers agreed that this approach was the best practices way (we also found several articles about this: see link and link)

They are continuing to investigate a solution for us. In the mean time we have come up with the following temporary workaround where we force a new DataCacheFactory object to be created in the event that we encounter one of the above exceptions.

public class AppFabricCacheProvider : ICacheProvider
{
    private const int RefreshWindowMinutes = -5;

    private static readonly object Locker = new object();
    private static AppFabricCacheProvider _instance;
    private static DataCache _cache;
    private static DateTime _lastRefreshDate;

    private AppFabricCacheProvider()
    {
    }

    public static AppFabricCacheProvider GetInstance()
    {
        lock (Locker)
        {
            if (_instance == null)
            {
                _instance = new AppFabricCacheProvider();
                var factory = new DataCacheFactory();
                _cache = factory.GetCache("AdMatter");
                _lastRefreshDate = DateTime.UtcNow;
            }
        }
        return _instance;
    }

    private static void ForceRefresh()
    {
        lock (Locker)
        {
            if (_instance != null && DateTime.UtcNow.AddMinutes(RefreshWindowMinutes) > _lastRefreshDate)
            {
                var factory = new DataCacheFactory();
                _cache = factory.GetCache("AdMatter");
                _lastRefreshDate = DateTime.UtcNow;
            }
        }
    }

    ...

    public T Put<T>(string key, T value)
    {
        try
        {
            _cache.Put(key, value);
        }
        catch (SocketException)
        {
            ForceRefresh();
            _cache.Put(key, value);
        }
        return value;
    }

Will update this thread when we learn more.

Palimpsest answered 20/9, 2012 at 17:6 Comment(2)

Since it is quite some time now, did you get a fix for this from Microsoft? – Renault 26/4, 2013 at 9:42

Actually, due to business time constraints, we ended up switching to using Couchbase for our caching needs (it also fit our requirements better) so we never followed up with Microsoft for a fix. – Palimpsest 27/4, 2013 at 4:6

High Availability is about protecting the data, not making it available every second (hence the retry exceptions). When a cache host goes down, you get an exception and are supposed to retry. During that time, access to HA cache's may throw a retry exception back to you while it is busy shuffling around and creating an extra copy. Regions complicate this more since it causes a larger chunk to have to be copied before it is HA again.

Also the client keeps a connection to all cache hosts so when one goes down it throws up the exception that something happened.

Basically when one host goes down, Appfabric freaks out until two copies of all data exist again in the HA cache's. We created a small layer in front of it to handle this logic and dropped the servers one at a time to make sure it handled all scenarios so that our app kept working but just was a tad bit slower.

Goddord answered 13/9, 2012 at 1:12 Comment(5)

Hey @Josh. We (@Flanagan and I) understand your point about the retry exceptions. In this particular scenario that exception isn't happening. We also expect a few errors when a cache host is down (as you pointed out there are open connections that will time out). However, what we are seeing is that the "No connection could be made" exceptions persist even 20-30 minutes after a cache host is taken down which was surprising to us. We expected that the client would handle this scenario. This happens even when the host that was taken down is not in the clients' dataCacheClient's hosts section. – Palimpsest 13/9, 2012 at 1:58

When you connect to a host, the client pulls a list of other hosts and connects to them. That is why you experience the issue you see at the end. Are you creating a DataCache object for every request or storing it as a singleton in a static variable? – Goddord 13/9, 2012 at 13:2

It is being stored in a static variable using a singleton. That's a good point - what would you recommend? – Palimpsest 13/9, 2012 at 19:54

One point though - even when a host is not in the web.config's dataCacheClient section, taking it down causes the "No connection could be made" exception – Palimpsest 13/9, 2012 at 20:23

Using just a single host in the config gets all hosts from the cluster. So I would try changing TransportProperties.ReceiveTimeout and RequestTimeout on the DataCacheLocalCacheProperties (or web.config) and see if you get the exception faster and can start working quicker. We put a 30 second sleep on our cache if it fails with a bad error (like connectioN), retries we try twice and then fail but don't sleep. – Goddord 20/9, 2012 at 14:34

After opening a ticket with Microsoft we narrowed it down to having a static DataCacheFactory object.

public class AppFabricCacheProvider : ICacheProvider
{
    private static readonly object Locker = new object();
    private static AppFabricCacheProvider _instance;
    private static DataCache _cache;

    private AppFabricCacheProvider()
    {
    }

    public static AppFabricCacheProvider GetInstance()
    {
        lock (Locker)
        {
            if (_instance == null)
            {
                _instance = new AppFabricCacheProvider();
                var factory = new DataCacheFactory();
                _cache = factory.GetCache("AdMatter");
            }
        }
        return _instance;
    }
    ...
}

The MS engineers agreed that this approach was the best practices way (we also found several articles about this: see link and link)

public class AppFabricCacheProvider : ICacheProvider
{
    private const int RefreshWindowMinutes = -5;

    private static readonly object Locker = new object();
    private static AppFabricCacheProvider _instance;
    private static DataCache _cache;
    private static DateTime _lastRefreshDate;

    private AppFabricCacheProvider()
    {
    }

    public static AppFabricCacheProvider GetInstance()
    {
        lock (Locker)
        {
            if (_instance == null)
            {
                _instance = new AppFabricCacheProvider();
                var factory = new DataCacheFactory();
                _cache = factory.GetCache("AdMatter");
                _lastRefreshDate = DateTime.UtcNow;
            }
        }
        return _instance;
    }

    private static void ForceRefresh()
    {
        lock (Locker)
        {
            if (_instance != null && DateTime.UtcNow.AddMinutes(RefreshWindowMinutes) > _lastRefreshDate)
            {
                var factory = new DataCacheFactory();
                _cache = factory.GetCache("AdMatter");
                _lastRefreshDate = DateTime.UtcNow;
            }
        }
    }

    ...

    public T Put<T>(string key, T value)
    {
        try
        {
            _cache.Put(key, value);
        }
        catch (SocketException)
        {
            ForceRefresh();
            _cache.Put(key, value);
        }
        return value;
    }

Will update this thread when we learn more.

Palimpsest answered 20/9, 2012 at 17:6 Comment(2)

Since it is quite some time now, did you get a fix for this from Microsoft? – Renault 26/4, 2013 at 9:42

Recommended topics

Hot tags