JDBC Connection Pool test query "SELECT 1" does not catch AWS RDS Writer/Reader failover
Asked Answered
B

3

13

We are running an AWS RDS Aurora/MySQL database in a cluster with a writer and a reader instance where the writer is replicated to the reader.

The application accessing the database is a standard java application using a HikariCP Connection Pool. The pool is configured to use a "SELECT 1" test query on checkout.

What we noticed is that once in a while RDS fails over the writer to the reader. The failover can also be replicated manually by clicking "Instance Actions/Failover" in the AWS console.

The connection pool is not able to detect the failover and the fact that it is now connected to a reader database, as the "SELECT 1" test queries still succeed. However any subsequent database updates fail with "java.sql.SQLException: The MySQL server is running with the --read-only option so it cannot execute this statement" errors.

It appears that instead of a "SELECT 1" test query, the Connection Pool can detect that it is now connected to the reader by using a "SELECT count(1) FROM test_table WHERE 1 = 2 FOR UPDATE" test query.

  1. Has anybody experienced the same issue?
  2. Are there any downsides on using "FOR UPDATE" in the test query?
  3. Are there any alternate or better approaches of handling an AWS RDS cluster writer/reader failover?

Your help is much appreciated

Bernie

Benjie answered 3/10, 2018 at 14:15 Comment(0)
C
19

I've been giving this a lot of thought in the two months since my original reply...


How Aurora endpoints work

When you start up an Aurora cluster you get multiple hostnames to access the cluster. For the purposes of this answer, the only two that we care about are the "cluster endpoint," which is read-write, and the "read-only endpoint," which is (you guessed it) read-only. You also have an endpoint for each node within the cluster, but accessing nodes directly defeats the purpose of using Aurora, so I won't mention them again.

For example, if I create a cluster named "example", I'll get the following endpoints:

  • Cluster endpoint: example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com
  • Read-only endpoint: example.cluster-ro-x91qlr44xxxz.us-east-1.rds.amazonaws.com

You might think that these endpoints would refer to something like an Elastic Load Balancer, which would be smart enough to redirect traffic on failover, but you'd be wrong. In fact, they're simply DNS CNAME entries with a really short time-to-live:

dig example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com


; <<>> DiG 9.11.3-1ubuntu1.3-Ubuntu <<>> example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40120
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com. IN A

;; ANSWER SECTION:
example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com. 5 IN CNAME example.x91qlr44xxxz.us-east-1.rds.amazonaws.com.
example.x91qlr44xxxz.us-east-1.rds.amazonaws.com. 4 IN CNAME ec2-18-209-198-76.compute-1.amazonaws.com.
ec2-18-209-198-76.compute-1.amazonaws.com. 7199 IN A 18.209.198.76

;; Query time: 54 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Fri Dec 14 18:12:08 EST 2018
;; MSG SIZE  rcvd: 178

When a failover happens, the CNAMEs are updated (from example to example-us-east-1a):

; <<>> DiG 9.11.3-1ubuntu1.3-Ubuntu <<>> example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27191
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com. IN A

;; ANSWER SECTION:
example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com. 5 IN CNAME example-us-east-1a.x91qlr44xxxz.us-east-1.rds.amazonaws.com.
example-us-east-1a.x91qlr44xxxz.us-east-1.rds.amazonaws.com. 4 IN CNAME ec2-3-81-195-23.compute-1.amazonaws.com.
ec2-3-81-195-23.compute-1.amazonaws.com. 7199 IN A 3.81.195.23

;; Query time: 158 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Fri Dec 14 18:15:33 EST 2018
;; MSG SIZE  rcvd: 187

The other thing that happens during a failover is that all of the connections to the "cluster" endpoint get closed, which will fail any in-process transactions (assuming that you've set reasonable query timeouts).

The connections to the "read-only" endpoint don't get closed, which means that whatever node gets promoted will get read-write traffic in addition to read-only traffic (assuming, of course, that your application doesn't just send all requests to the cluster endpoint). Since read-only connections are typically used for relatively expensive queries (eg, reporting), this may cause performance problems for your read-write operations.

The Problem: DNS Caching

When failover happens, all in-process transactions will fail (again, assuming that you've set query timeouts). There will be a short amount of time that any new connections will also fail, as the connection pool attempts to connect to the same host before it's done with recovery. In my experience, failover takes around 15 seconds, during which time your application shouldn't expect to get a connection.

After that 15 seconds (or so), everything should return to normal: your connection pool attempts to connect to the cluster endpoint, it resolves to the IP address of the new read-write node, and all is well. But if anything prevents resolving that chain of CNAMEs, you may find that your connection pool makes connections to a read-only endpoint, which will fail as soon as you try an update operation.

In the case of the OP, he had his own CNAME with a longer timeout. So rather than connect to the cluster endpoint directly, he would connect to something like database.example.com. This is a useful technique in a world where you would manually fail-over to a replica database; I suspect it's less useful with Aurora. Regardless, if you use your own CNAMEs to refer to database endpoints, you need them to have short time-to-live values (certainly no more than 5 seconds).

In my original answer, I also pointed out that Java caches DNS lookups, in some cases forever. The behavior of this cache depends on (I believe) the version of Java, and also whether you're running with a security manager installed. With OpenJDK 8 running as an application, it appears that the JVM will delegate all naming lookups and not cache anything itself. However, you should be familiar with the networkaddress.cache.ttl system property, as described in this Oracle doc and this SO question.

However, even after you've eliminated any unexpected caches, there may still be times where the cluster endpoint is resolved to a read-only node. That leaves the question of how you handle this situation.

Not-so-good solution: use a read-only test on checkout

The OP was hoping to use a database connection test to verify that his application was running on a read-only node. This is surprisingly hard to do: most connection pools (including HikariCP, which is what the OP is using) simply verify that the test query executes successfully; there's no ability to look at what it returns. This means that any test query has to throw an exception to fail.

I haven't been able to come up with a way to make MySQL throw an exception with just a stand-alone query. The best I've come up with is to create a function:

DELIMITER EOF

CREATE FUNCTION throwIfReadOnly() RETURNS INTEGER
BEGIN
    IF @@innodb_read_only THEN
        SIGNAL SQLSTATE 'ERR0R' SET MESSAGE_TEXT = 'database is read_only';
    END IF;
    RETURN 0;
END;
EOF

DELIMITER ;

Then you call that function in your test query:

select throwIfReadOnly() 

This works, mostly. When running my test program I could see a series of "failed to validate connection" messages, but then, inexplicably, the update query would run with a read-only connection. Hikari doesn't have a debug message to indicate which connection it hands out, so I couldn't identify whether it had allegedly passed validation.

But aside from that possible problem, there's a deeper issue with this implementation: it hides the fact that there's a problem. A user makes a request, and maybe waits for 30 seconds to get a response. There's nothing in the log (unless you enable Hikari's debug logging) to give a reason for this delay.

Moreover, while the database is inaccessible Hikari is furiously trying to make connections: in my single-threaded test, it would attempt a new connection every 100 milliseconds. And these are real connections, they simply go to the wrong host. Throw in an app-server with a few dozen or hundred threads, and that could cause a significant ripple effect on the database.

Better solution: use a read-only test on checkout, via a wrapper Datasource

Rather than let Hikari silently retry connections, you could wrap the HikariDataSource in your own DataSource implementation and test/retry yourself. This has the benefit that you can actually look at the results of the test query, which means that you can use a self-contained query rather than calling a separately-installed function. It also lets you log the problem using your preferred log levels, lets you pause between attempts, and gives you a chance to change pool configuration.

private static class WrappedDataSource
implements DataSource
{
    private HikariDataSource delegate;

    public WrappedDataSource(HikariDataSource delegate) {
        this.delegate = delegate;
    }

    @Override
    public Connection getConnection() throws SQLException {
        while (true) {
            Connection cxt = delegate.getConnection();
            try (Statement stmt = cxt.createStatement()) {
                try (ResultSet rslt = stmt.executeQuery("select @@innodb_read_only")) {
                    if (rslt.next() && ! rslt.getBoolean(1)) {
                        return cxt;
                    }
                }
            }
            // evict connection so that we won't get it again
            // should also log here
            delegate.evictConnection(cxt);
            try {
                Thread.sleep(1000);
            }
            catch (InterruptedException ignored) {
                // if we're interrupted we just retry
            }
        }
    }

    // all other methods can just delegate to HikariDataSource

This solution still suffers from the problem that it introduces a delay into user requests. True, you know that it's happening (which you didn't with the on-checkout test), and you could introduce a timeout (limit the number of times through the loop). But it still represents a bad user experience.

The best (imo) solution: switch into "maintenance mode"

Users are incredibly impatient: if it takes more than a few seconds to get a response back, they'll probably try to reload the page, or submit the form again, or do something that doesn't help and may hurt.

So I think the best solution is to fail quickly and let them know that somethng's wrong. Somewhere near the top of the call stack you should already have some code that responds to exceptions. Maybe you just return a generic 500 page now, but you can do a little better: look at the exception, and return a "sorry, temporarily unavailable, try again in a few minutes" page if it's a read-only database exception.

At the same time, you should send a notification to you ops staff: this may be a normal maintance window failover, or it may be something more serious (but don't wake them up unless you have some way of knowing that it's more serious).

Cleanshaven answered 3/10, 2018 at 23:0 Comment(6)
That's a good pointer. I did some more digging and found that we use our own DNS CNAME on top of the RDS cluster url (for flexibility reasons so we can repoint the db without having to bounce the app). The DNS name is configured with the default 300 seconds TTL. I dropped the TTL to 1 second now. I'm not sure what's better though, having a transactional test query or a 1 second DNS TTL. Also I believe that even with 1 second there's a slim chance that the problem still can happen if the app reconnects within the 1 second TTL.Benjie
You normally pick higher TTL values when you expect a lot of requests, in order to minimize load on the nameservers. For what it's worth, AWS has a 5 second TTL on the cluster address (you can use dig ADDRESS to see this).Cleanshaven
I agree with your concern about the race condition that still exists with even a 5 second cache. If your pool supports testing the result of the query you could return the innodb_read_only variable -- but I don't know of any pools that do this. The only alternative, I think, is to use a stored procedure that checks this value and then does a KILL CONNECTION_ID(). If I have time over the next day or so I'll try this out (I see it as being something I could use in the future) and update this answer.Cleanshaven
Interesting approach. I think RDS requires "CALL mysql.rds_kill(thread-ID);" or "CALL mysql.rds_kill_query(thread-ID);" to kill connections for what it's worth.Benjie
I've only known pools to use the test query to verify connectivity, not check that something was returned. I checked the source and Hikari is the same way. Too bad, because that would have been a trivial check. Perhaps raise an issue with the project? It seems like it would be a really nice addition.Cleanshaven
I came across this article which suggests the following SQL statement, to check if the connection pertains to a read-only endpoint: Select case when @@read_only + @@innodb_read_only = 0 then 1 else (select table_name from information_schema.tables) end as `1` The idea is to cause an error because it's unexpected that more than one row is returned.Sturtevant
P
2

set connection pool idle connection timeout in your java code datasource. set around 1000ms

Photo answered 3/10, 2018 at 21:50 Comment(3)
You got a point there! I checked hikari but unfortunately "The minimum allowed value is 10000ms (10 seconds). Default: 600000 (10 minutes)". See github.com/brettwooldridge/HikariCP#configuration-knobs-babyBenjie
10 sec is fine, if you want lesser then you can use ComboPooledDataSource. Also if you use mySql driver I will recommend use MariaDb driver, it works better with AWS Aurora MysqlPhoto
Thanks for the tips!Benjie
J
1

Aurora failover

As Sayantan Mandal hints in his comments. When using Aurora just use the MariaDb driver it has support for failover.

It is documented here: https://aws.amazon.com/blogs/database/using-the-mariadb-jdbc-driver-with-amazon-aurora-with-mysql-compatibility/

And here: https://mariadb.com/kb/en/failover-and-high-availability-with-mariadb-connector-j/#aurora-endpoints-and-discovery

Your connection string will start with jdbc:mariadb:aurora// or jdbc:mysql:aurora//.

The connection pool normally calls JDBC4Connection#isValid which should correctly return false with this driver when on a read only replica.

No custom code required.

DNS Caching

As for DNS caching (networkaddress.cache.ttl) depending on your JVM the default is 30 or 60s depening of whether a security manager is present.

You can retrieve the value at runtime with this snippet if unsure:

Class.forName("sun.net.InetAddressCachePolicy").getMethod("get").invoke(null)

With 30s DNS cache, your connection will start to arrive at the read-write replica at most 30s after the failover happens.

Jeanette answered 20/10, 2020 at 7:41 Comment(1)
The maria db driver will not be able to cope with this specific situation as the issue is caused by the DNS TTL of the AWS aurora DNS entry which doesn't refresh fast enough under db load from the app. Also while java has its own DNS cache on top of the OS DNS cache, the OS DNS cache is controlled by AWS's Route 53 entry which you can't change. Best solution IMO is still a test query that checks if the db is in read only mode.Benjie

© 2022 - 2024 — McMap. All rights reserved.