AppFabric Cache seems unstable
Asked Answered
D

1

8

We're trying to use AppFabric distributed cache. After a lot of back and forth with non-domain servers we finally put them in a domain and installation/setup was a bit easier. We got it up and running after fighting through a ton of errors, most of which seems trivial to include some test or more descriptive error message for in AppFabric. "Temporary error" does not explain a lot...

But there are still issues.

We set up 3 servers, one of which is "lead". We finally got the cache working and we confirmed this by pointing a Network Load Balancer to one server at a time confirming that we can set cache at one server and retrieve it at another.

Then I restarted the AppFabric Caching service on all servers and suddenly it is not working. Get-CacheHost says they are up, but we get exceptions like:

ErrorCode<ERRCA0018>:SubStatus<ES0001>:The request timed out
ErrorCode<ERRCA0017>:SubStatus<ES0001>:There is a temporary failure. Please retry later.

Why would this error condition occur by simply restarting the services?
Is AppFabric Cache really ready for production use?
What happens if a server goes offline? Long timeouts?
Are we dependent on the "lead" server being up?

I suspect it will be back up after 5-10 minutes of R&R. It seems to come back by itself sometimes.

Update: It did come up after a few minutes. We have now tested by removing one server from the cluster and it resulted in a long timeout and finally an exception.

Dockhand answered 20/1, 2011 at 11:32 Comment(1)
WHY on earth does it take so long to come back up? on a single server. whatever the technical reason is it sure makes me skeptical about trusting the whole platformStinkwood
D
7

We have been debugging this for some time and I'm sharing what we have found so far.

  • UAC on Windows 2008 actually blocks access to local computer, so commands towards local computer will fail. Start PowerShell as admin or turn off UAC completely to bypass.
  • Simply changing the config file manually will not work. You need to use export and import commands.
  • Firewalls are a major issue as the installer opens the 222* range of ports, but the PowerShell tools use other Windows services. Turning off the firewall on all servers (not recommended) solved the problem.
  • If a server is removed from the cluster there will be an initial timeout before the cluster can operate again.
  • After restart the cluster uses 2-5 minutes to get back up.
  • If restarting and one server is not reachable the startup time is increased.
  • If the server holding the shared fileshare for config is not reachable the services will not start. We tried to solve this by giving each server a private share.
Dockhand answered 27/1, 2011 at 9:14 Comment(3)
If I understand correctly, using a SQL configuration provider would cause the cluster management to be done by the SQL Server and not by a 'lead host', and so it might reduce the number of issues you encounter? [ msdn.microsoft.com/en-us/library/ee790934.aspx#sectionSection1 ]. IIRC, this should allow you to be able to contact any one cache host to access the cache cluster.Graber
Have you ever got to any conclusions on this? I am facing the same issues.Litigable
@Tedd Hansen were you able to get this working? "If the server holding the shared fileshare for config is not reachable the services will not start. We tried to solve this by giving each server a private share." The standard procedure is to have a common file share. Did you have to use some "hacking" ways? Please share your experienceAfrica

© 2022 - 2024 — McMap. All rights reserved.