How to resolve "Connection Reset" when using Java Apache HttpClient 4.5.12
Asked Answered
H

0

6

We have been discussing with one of our data providers the issue that some of the requests from our HTTP requests are intermittently failing due to "Connection Reset" exceptions, but we have also seen "The target server failed to respond" exceptions too.

Many Stack Overflow posts point to some potential solutions, namely

I'm hoping this question will help me get to the bottom of the root cause.

Context

It's a Java web application hosted in AWS Elastic Beanstalk with 2..4 servers based on load. The Java WAR file uses HttpClient 4.5.12 to communicate. Over the last few months we have seen

45 x Connection Reset (only 3 were timeouts over 30s, the others failed within 20ms)

To put this into context, we perform in the region of 10,000 requests to this supplier, so the error rate isn't excessive, but it is very inconvenient because our customers pay for the service that then subsequently fails.

Right now we are trying to focus on eliminating the "connection reset" scenarios and we have been recommended to try the following:

1) Restart our app servers (a desperate just-in-case scenario)

2) Change the DNS servers to use Google 8.8.8.8 & 8.8.4.4 (so our request take a different path)

3) Assign a static IP to each server (so they can enable us to communicate without going through their CloudFront distribution)

We will work through those suggestions, but at the same time I want to understand where our HttpClient implementation might not be quite right.

Typical usage

User Request --> Our server (JAX-RS request) --> HttpClient to 3rd party --> Response received e.g. JSON/XML --> Massaged response is sent back (Our JSON format)

Technical details

Tomcat 8 with Java 8 running on 64bit Amazon Linux

4.5.12 HttpClient 4.4.13 HttpCore <-- Maven dependencies shows HttpClient 4.5.12 requires 4.4.13 4.5.12 HttpMime

Typically a HTTP request will take anywhere between 200ms and 10 seconds, with timeouts set around 15-30s depending on the API we are invoking. I also use a connection pool and given that most requests should be complete within 30 seconds I felt it was safe to evict anything older than double that period.

Any advice on whether these are sensible values is appreciated.

// max 200 requests in the connection pool
CONNECTIONS_MAX = 200;

// each 3rd party API can only use up to 50, so worst case 4 APIs can be flooded before exhuasted
CONNECTIONS_MAX_PER_ROUTE = 50;

// as our timeouts are typically 30s I'm assuming it's safe to clean up connections
// that are double that

// Connection timeouts are 30s, wasn't sure whether to close 31s or wait 2xtypical = 60s
CONNECTION_CLOSE_IDLE_MS = 60000;

// If the connection hasn't been used for 60s then we aren't busy and we can remove from the connection pool
CONNECTION_EVICT_IDLE_MS = 60000;

// Is this per request or each packet, but all requests should finish within 30s
CONNECTION_TIME_TO_LIVE_MS = 60000;

// To ensure connections are validated if in the pool but hasn't been used for at least 500ms
CONNECTION_VALIDATE_AFTER_INACTIVITY_MS = 500; // WAS 30000 (not test 500ms yet)

Additionally we tend to set the three timeouts to 30s, but I'm sure we can fine-tune these...

// client tries to connect to the server. This denotes the time elapsed before the connection established or Server responded to connection request.
// The time to establish a connection with the remote host
.setConnectTimeout(...) // typical 30s - I guess this could be 5s (if we can't connect by then the remote server is stuffed/busy)

// Used when requesting a connection from the connection manager (pooling)
// The time to fetch a connection from the connection pool
.setConnectionRequestTimeout(...) // typical 30s - I guess only applicable if our pool is saturated, then this means how long to wait to get a connection?

// After establishing the connection, the client socket waits for response after sending the request. 
// This is the time of inactivity to wait for packets to arrive
.setSocketTimeout(...) // typical 30s - I believe this is the main one that we care about, if we don't get our payload in 30s then give up

I have copy and pasted the main code we use for all GET/POST requests but stripped out the un-important aspects such as our retry logic, pre-cache and post-cache

We are using a single PoolingHttpClientConnectionManager with a single CloseableHttpClient, they're both configured as follows...

    private static PoolingHttpClientConnectionManager createConnectionManager() {
        PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();

        cm.setMaxTotal(CONNECTIONS_MAX); // 200
        cm.setDefaultMaxPerRoute(CONNECTIONS_MAX_PER_ROUTE); // 50
        cm.setValidateAfterInactivity(CONNECTION_VALIDATE_AFTER_INACTIVITY_MS); // Was 30000 now 500

        return cm;
    }
    private static CloseableHttpClient createHttpClient() {

        httpClient = HttpClientBuilder.create()
                .setConnectionManager(cm)
                .disableAutomaticRetries() // our code does the retries
                .evictIdleConnections(CONNECTION_EVICT_IDLE_MS, TimeUnit.MILLISECONDS) // 60000
                .setConnectionTimeToLive(CONNECTION_TIME_TO_LIVE_MS, TimeUnit.MILLISECONDS) // 60000
                .setRedirectStrategy(LaxRedirectStrategy.INSTANCE)
                // .setKeepAliveStrategy() - The default implementation looks solely at the 'Keep-Alive' header's timeout token.
                .build();
        return httpClient;
    }

Every minute I have a thread that tries to reap connections

    public static PoolStats performIdleConnectionReaper(Object source) {
        synchronized (source) {
            final PoolStats totalStats = cm.getTotalStats();
            Log.info(source, "max:" + totalStats.getMax() + " avail:" + totalStats.getAvailable() + " leased:" + totalStats.getLeased() + " pending:" + totalStats.getPending());
            cm.closeExpiredConnections();
            cm.closeIdleConnections(CONNECTION_CLOSE_IDLE_MS, TimeUnit.MILLISECONDS); // 60000
            return totalStats;
        }
    }

This is the custom method that performs all HttpClient GET/POST, it does stats, pre-cache, post-cache and other useful stuff, but I've stripped all of that out and this is the typical outline performed for each request. I've tried to follow the pattern as per the HttpClient docs that tell you to consume the entity and close the response. Note I don't close the httpClient because one instance is being used for all requests.

    public static HttpHelperResponse execute(HttpHelperParams params) {

        boolean abortRetries = false;

        while (!abortRetries && ret.getAttempts() <= params.getMaxRetries()) {

            // 1 Create HttpClient
            // This is done once in the static init CloseableHttpClient httpClient = createHttpClient(params);

            // 2 Create one of the methods, e.g. HttpGet / HttpPost - Note this also adds HTTP headers 
            // (see separate method below)
            HttpRequestBase request = createRequest(params);

            // 3 Tell HTTP Client to execute the command
            CloseableHttpResponse response = null;
            HttpEntity entity = null;
            boolean alreadyStreamed = false;

            try {

                response = httpClient.execute(request);
                if (response == null) {
                    throw new Exception("Null response received");
                } else {

                    final StatusLine statusLine = response.getStatusLine();
                    ret.setStatusCode(statusLine.getStatusCode());
                    ret.setReasonPhrase(statusLine.getReasonPhrase());

                    if (ret.getStatusCode() == 429) {
                        try {
                            final int delay = (int) (Math.random() * params.getRetryDelayMs());
                            Thread.sleep(500 + delay); // minimum 500ms + random amount up to delay specified
                        } catch (Exception e) {
                            Log.error(false, params.getSource(), "HttpHelper Rate-limit sleep exception", e, params);
                        }
                    } else {

                        // 4 Read the response
                        // 6 Deal with the response
                        // do something useful with the response body                        
                        entity = response.getEntity();

                        if (entity == null) {
                            throw new Exception("Null entity received");
                        } else {
                            ret.setRawResponseAsString(EntityUtils.toString(entity, params.getEncoding()));
                            ret.setSuccess();
                            if (response.getAllHeaders() != null) {
                                for (Header header : response.getAllHeaders()) {
                                    ret.addResponseHeader(header.getName(), header.getValue());
                                }
                            }
                        }

                    }
                }

            } catch (Exception ex) {

                if (ret.getAttempts() >= params.getMaxRetries()) {
                    Log.error(false, params.getSource(), ex);
                } else {
                    Log.warn(params.getSource(), ex.getMessage());
                }

                ret.setError(ex); // If we subsequently get a response then the error will be cleared.                
            } finally {

                ret.incrementAttempts();

                // Any HTTP 2xx are considered successfull, so stop retrying, or if
                // a specifc HTTP code has been passed to stop retring
                if (ret.getStatusCode() >= 200 && ret.getStatusCode() <= 299) {
                    abortRetries = true;
                } else if (params.getDoNotRetryStatusCodes().contains(ret.getStatusCode())) {
                    abortRetries = true;
                }

                if (entity != null) {
                    try {
                        // and ensure it is fully consumed - hand it back to the pool
                        EntityUtils.consume(entity);
                    } catch (IOException ex) {
                        Log.error(false, params.getSource(), "HttpHelper Was unable to consume entity", params);
                    }

                }

                if (response != null) {
                    try {
                        // The underlying HTTP connection is still held by the response object
                        // to allow the response content to be streamed directly from the network socket.
                        // In order to ensure correct deallocation of system resources
                        // the user MUST call CloseableHttpResponse#close() from a finally clause.
                        // Please note that if response content is not fully consumed the underlying
                        // connection cannot be safely re-used and will be shut down and discarded
                        // by the connection manager.                     
                        response.close();
                    } catch (IOException ex) {
                        Log.error(false, params.getSource(), "HttpHelper Was unable to close a response", params);
                    }
                }

                // When using connection pooling we don't want to close the client, otherwise the connection
                // pool will also be closed
                //                if (httpClient != null) {
                //                    try {
                //                        httpClient.close();
                //                    } catch (IOException ex) {
                //                        Log.error(false, params.getSource(), "HttpHelper Was unable to close httpClient", params);
                //                    }
                //                }


            }
        }

        return ret;
    }
    private static HttpRequestBase createRequest(HttpHelperParams params) {

        ...
        request.setConfig(RequestConfig.copy(RequestConfig.DEFAULT)
            // client tries to connect to the server. This denotes the time elapsed before the connection established or Server responded to connection request.
            // The time to establish a connection with the remote host
            .setConnectTimeout(...) // typical 30s

            // Used when requesting a connection from the connection manager (pooling)
            // The time to fetch a connection from the connection pool
            .setConnectionRequestTimeout(...) // typical 30s

            // After establishing the connection, the client socket waits for response after sending the request. 
            // This is the time of inactivity to wait for packets to arrive
            .setSocketTimeout(...) // typical 30s

            .build()
        );

        return request;
    }
Hendren answered 5/6, 2020 at 16:24 Comment(6)
Send a message with a wire / context log of the session to [email protected]. By the way Connection reset is pretty much always a server side issue, not a client side.Ical
Thanks @Ical - a wire/context log? as in low-level EC2 terminal tcpdump/other, or is there something in HttpClient I can turn on for debugging?Hendren
Please see hc.apache.org/httpcomponents-client-4.5.x/logging.htmlIcal
I have a similar (maybe the same) problem with httpclient 4.5.11, I don't have a solution yet tough... For now I'm making the request in a loop, until it works (max 4 times), but I would like to know of a solution.Epenthesis
Hi @LucasBasquerotto, one of our suppliers has confirmed that they were able to add sticky sessions to their load balancer, since doing this we have had no connection reset errors from that specific environment. It would appear that connection pooling with settings to keep the connection open might be causing us issues, I'm not yet sure what settings to tweak to try to isolate this completely, but I did drop the CONNECTION_EVICT_IDLE_MS from 60000 to 500ms and that helped a bit - but I think the ultimate solution is to turn off keeping connections open between multiple requests.Hendren
We are experiencing the same issue, which gets resolved if we use SimpleClientHttpRequestFactory so seems to be an issue with the pool somehow but not able to find the reason. Please do let know if you found some more information or a resolution.Piste

© 2022 - 2024 — McMap. All rights reserved.