How to get around "connection reset by peer" when using Elasticsearch's RestClient
Asked Answered
T

1

9

We are using Hibernate Search 5.10.3.Final against Elasticsearch 5.6.6 server.

The connection between our app and ES seems solid when issuing FullTextQueries directly, maybe b/c HibernateSearch has some built in retry method, I'm not sure, however, also in our app, we use the Elasticsearch's RestClient to issue a direct call to _analyze, this is where we get a connection reset by peer IOException when our firewall closes idle connections after 30 minutes.

java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_131]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_131]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_131]
    at sun.nio.ch.IOUtil.read(IOUtil.java:197) ~[?:1.8.0_131]
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[?:1.8.0_131]
    at org.apache.http.impl.nio.reactor.SessionInputBufferImpl.fill(SessionInputBufferImpl.java:204) ~[httpcore-nio-4.4.5.jar:4.4.5]
    at org.apache.http.impl.nio.codecs.AbstractMessageParser.fillBuffer(AbstractMessageParser.java:136) ~[httpcore-nio-4.4.5.jar:4.4.5]
    at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:241) ~[httpcore-nio-4.4.5.jar:4.4.5]
    at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) ~[httpasyncclient-4.1.2.jar:4.1.2]
    at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) ~[httpasyncclient-4.1.2.jar:4.1.2]
    at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) ~[httpcore-nio-4.4.5.jar:4.4.5]
    at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) ~[httpcore-nio-4.4.5.jar:4.4.5]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) ~[httpcore-nio-4.4.5.jar:4.4.5]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) ~[httpcore-nio-4.4.5.jar:4.4.5]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) ~[httpcore-nio-4.4.5.jar:4.4.5]
    at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[httpcore-nio-4.4.5.jar:4.4.5]
    at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) ~[httpcore-nio-4.4.5.jar:4.4.5]
    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131]

For completeness, here is most of our RestClient code:

SearchFactory searchFactory = fts.getSearchFactory();
IndexFamily indexFamily = searchFactory.getIndexFamily(ElasticsearchIndexFamilyType.get());
ElasticsearchIndexFamily elasticsearchIndexFamily = indexFamily.unwrap(ElasticsearchIndexFamily.class);
RestClient restClient = elasticsearchIndexFamily.getClient(RestClient.class);

Map<String, String> rawData = new HashMap<>();
rawData.put("analyzer", analyzer);
rawData.put("text", text);

try {
    String jsonData = objectMapper.writeValueAsString(rawData);
    HttpEntity entity = new NStringEntity(jsonData, ContentType.APPLICATION_JSON);

    Response response = restClient.performRequest("GET", "vendor/_analyze", Collections.emptyMap(), entity);

    int statusCode = response.getStatusLine().getStatusCode();
    if (statusCode == HttpStatus.SC_OK) {
        // we parse the response here
    }
} catch (IOException e) {
    String message = "Error communicating with Elasticsearch!";
    logger.error(message, e);
    throw new IllegalStateException(message, e);
}

We tried creating a 'heartbeat' which issues a small '_cluster/health' call using RestClient every minute, but that doesn't seem to solve the issue completely either. Even the heartbeat fails with the same IOException on occasion.

  1. Can someone explain the number of connections between HibernateSearch and ES (I thought it defaulted to 20 or 2 depending on ES clustered or not) and if the connections are used in round robin or random order?
  2. Will a simple retry of the RestClient call 'wake' the connection up again?
  3. Or do we need to manually reconnect the connection to ES and if so, how?
  4. Lastly, is there an existing hibernate search setting that would solve this, possibly hibernate.search.default.elasticsearch.discovery.enabled or another?
Trillion answered 25/10, 2018 at 20:34 Comment(1)
I can see how the heartbeat ping doesnt generate a wakeup because it establishes a different connection than the persistent one that is already in place. You'd need to run the ping call on that existing connection?Uttermost
H
11

Explanation of the problem

I'll assume that your explanation of the connection being closed by your firewall after 30 minutes is the correct one.

From what I can see, the Apache HTTP client decides how long to keep a given connection alive based on a ConnectionKeepAliveStrategy. By default, this is org.apache.http.impl.client.DefaultConnectionKeepAliveStrategy, and this will keep connections alive for as long as recommended by the Keep-Alive header in responses from the Elasticsearch server, or indefinitely if the Elasticsearch server does not return such a header in responses.

I did a few tests, and apparently Elasticsearch does not return any Keep-Alive header, so currently, connections are re-used indefinitely, at least until your network kills them.

Once a connection is killed, you could hope that automatic retries step in, but they are only effective if you have more than one Elasticsearch node. If you have only one node and a request fails, then the rest client won't retry on the same node.

So, all in all, the failures are expected. What is not, is the fact that you only witnessed the failures with your own client code, but I guess you may have overlooked some errors in the logs?

Solution (hopefully)

Maybe the Apache HTTP client can handle re-opening connections automatically when they are forcefully closed, but I couldn't find such a feature.

I couldn't find a way to make the Elasticsearch server add a Keep-Alive header to its HTTP responses, either.

If you use HTTP, and not HTTPS (in which case I hope it's a private network), you may be able to configure your network infrastructure to insert such headers in every HTTP message. If you use Elasticsearch behind a proxy, such as an Apache server, you should also be able to do so.

Otherwise, in order to configure it explicitly on the client side, you can use the org.hibernate.search.elasticsearch.client.spi.ElasticsearchHttpClientConfigurer extension point in Hibernate Search.

WARNING: this extension point is an SPI, and on top of that it's experimental which means it may change in incompatible ways in any newer version of Hibernate Search. On your next upgrade, you may have to change your code, even for a micro upgrade. No guarantee on our side.

Create an implementation:

package com.acme.config;

import org.hibernate.search.elasticsearch.client.spi.ElasticsearchHttpClientConfigurer;

public class MyHttpConfigurer implements ElasticsearchHttpClientConfigurer {
   private static final int KEEP_ALIVE_MS = 20 * 60 * 1000; // 20 minutes
    @Override
    public void configure(HttpAsyncClientBuilder builder, Properties properties) {
        builder.setKeepAliveStrategy( (response, context) -> KEEP_ALIVE_MS );
    }
}

Register your implementation by creating a META-INF/services/org.hibernate.search.elasticsearch.client.spi.ElasticsearchHttpClientConfigurer file with this content:

com.acme.config.MyHttpConfigurer

... and you're done.

Start your application once in debug mode with a breakpoint in MyHttpConfigurer to check it's executed, and if it is, the HTTP client should automatically stop using idle connections after 20 minutes and you shouldn't experience the same problem again.

To answer your questions

  1. Can someone explain the number of connections between HibernateSearch and ES (I thought it defaulted to 20 or 2 depending on ES clustered or not) and if the connections are used in round robin or random order?

From the documentation:

hibernate.search.default.elasticsearch.max_total_connection 20 (default)

hibernate.search.default.elasticsearch.max_total_connection_per_route 2 (default)

It does not depend on whether ES is clustered or not. It depends on how many nodes/routes the clients knows of. If automatic discovery is disabled (hibernate.search.default.elasticsearch.discovery.enabled false, the default), the nodes known to the client are the ones you configured explicitly. If it's enabled, and there are more than one node in the cluster, then the client may know of more nodes than you configured explicitly.

By default, you'll use at most two connections per host known to your client, but never more than 20 connections total. So if 9 nodes are known, you'll use at most 18 connections, if 10 nodes are known, you'll use at most 20 connections, and if 11 or more nodes are known, you'll still use at most 20 connections.

  1. Will a simple retry of the RestClient call 'wake' the connection up again?

As far as I know, it should, but then I don't know what exactly resets your connection, so it's hard to tell.

  1. Or do we need to manually reconnect the connection to ES and if so, how?

I don't think you should do that yourself. Connections are managed automatically at a very low level. Not by Hibernate Search, not even by the Rest Client, but by the HTTP client.

Anyway, if you really want to go that way, you'll have to get your hands on the HTTP client somehow. I don't know how.

  1. Lastly, is there an existing hibernate search setting that would solve this, possibly hibernate.search.default.elasticsearch.discovery.enabled or another?

hibernate.search.default.elasticsearch.discovery.enabled will only help if you need more connections and your Elasticsearch is clustered; in your case it seems your existing connections are killed off after a certain time, so even if you increase the number of connections, you'll still experience the same problem.

Hulton answered 26/10, 2018 at 7:29 Comment(2)
Thanks Yoann. My plan is to try some RestClient performRequest retry code first as I feel this is the simplest solution mentioned. In addition, it was suggested to try setting the linux keepalive (tcp_keepalive_time) to under 30 minutes on our ES servers so the firewall doesn't close the connections.Trillion
Update - we tested both the retry code and setting the keepalive to 900 (15 minutes), and they both worked, we are initially implementing both solutions but it seems the keepalive would work fine all by itself.Trillion

© 2022 - 2024 — McMap. All rights reserved.