Amazon Elasticache Failover
Asked Answered
I

2

7

We have been using AWS Elasticache for about 6 months now without any issues. Every night we have a Java app that runs which will flush DB 0 of our redis cache and then repopulate it with updated data. However we had 3 instances between July 31 and August 5 where our DB was successfully flushed and then we were not able to write the new data to the database.

We were getting the following exception in our application:

redis.clients.jedis.exceptions.JedisDataException: redis.clients.jedis.exceptions.JedisDataException: READONLY You can't write against a read only slave.

When we look at the cache events in Elasticache we can see

Failover from master node prod-redis-001 to replica node prod-redis-002 completed

We have not been able to diagnose the issue and since the app was running fine for the past 6 months I am wondering if it is something related to a recent Elasticache release that was done on the 30th of June. https://aws.amazon.com/releasenotes/Amazon-ElastiCache

We have always been writing to our master node and we only have 1 replica node.

If someone could offer any insight it would be much appreciated.

EDIT: This seems to be an intermittent problem. Some days it will fail other days it runs fine.

Isadora answered 5/8, 2015 at 23:13 Comment(5)
Are you using IP or DNS name when connecting to your elasticache redis? Normally if you are using DNS names you shouldn't have such a problem because the master DNS name should remain the same, only the IP behind will change (at least in theory). Also this is very AWS internals specific, try also to post the question on their forums.Rangel
Thanks, I am using the DNS name. I haven't had this issue for the past 2 days so maybe it was a issue on the AWS side that has now been fixed. I will try to get an answer from AWS support.Isadora
An extremely similar thing happened to my app tonight. First thing AWS support asked was "Are you using jedis?" We aren't. They couldn't tell us much. We wound up building new nodes and cutting over. I'll update if they tell us more.Frans
Thanks for that. We have been in contact with AWS support for the past few weeks. Ill post an answer now with what we have found.Isadora
By the way they were asking us about Jedis also. As it turns out if a failover happens the slave node is promoted to the master. Jedis will not reconnect, so it will be connected to the slave if it was previously connected to the master. It doesn't have any effect on the cause of the failover.Isadora
I
9

We have been in contact with AWS support for the past few weeks and this is what we have found.

Most Redis requests are synchronous including the flush so it will block all other requests. In our case we are actually flushing 19m keys and it takes more then 30 seconds.

Elasticache performs a health check periodically and since the flush is running the health check will be blocked, thus causing a failover.

We have been asking the support team how often the health check is performed so we can get an idea of why our flush is only causing a failover 3-4 times a week. The best answer we can get is "We think its every 30 seconds". However our flush consistently takes more then 30 seconds and doesn't consistently fail.

They said that they may implement the ability to configure the timing of the health check however they said this would not be done anytime soon.

The best advice they could give us is:

1) Create a completely new cluster for loading the new data on, and instead of flushing the previous cluster, re-point your application(s) to the new cluster, and remove the old one.

2) If the data that you are flushing is an update version of the data, consider not flushing, but updating and overwriting new keys?

3) Instead of flushing the data, set the expiry of the items to be when you would normally flush, and let the keys be reclaimed (possibly with a random time to avoid thundering herd issues), and then reload the data.

Hope this helps :)

Isadora answered 10/9, 2015 at 6:41 Comment(0)
C
0

Currently for Redis versions from 6.2 AWS ElastiCache has a new feature of thread monitoring. So the health check doesn't happen in the same thread as all other actions of Redis. Redis can continue to proceed a long command / lua script, but will still considered healthy. Because of this new feature failovers should happen less.

Corkhill answered 17/11, 2022 at 8:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.