Spring Boot random "SSLException: Connection reset" in Kubernetes with JDK11

Asked 12/11, 2020 at 19:41 Answered 13/8, 2024 at 2:52

Context:

We have a Spring Boot (2.3.1.RELEASE) web app
It's written in Java 8 but running inside of a container with Java 11 (openjdk:11.0.6-jre-stretch).
It has a DB connection and an upstream service that is called via https (simple RestTemplate#exchange method) (this is important!)
It is deployed inside of a Kubernetes cluster (not sure if this is important)

Problem:

Every day, I see a small percentage of requests towards the upstream service fail with this error: I/O error on GET request for "https://upstream.xyz/path": Connection reset; nested exception is javax.net.ssl.SSLException: Connection reset
The errors are totally random and happen intermittently.
We have had a similar error (javax.net.ssl.SSLProtocolException: Connection reset) that was related to JRE11 and it's TLS 1.3 negotiation issue. We have updated our Docker image to above mentioned and that fixed it.
This is the stack trace from the error:

java.net.SocketException: Connection reset
    at java.base/java.net.SocketInputStream.read(Unknown Source)
    at java.base/java.net.SocketInputStream.read(Unknown Source)
    at java.base/sun.security.ssl.SSLSocketInputRecord.read(Unknown Source)
    at java.base/sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(Unknown Source)
    at java.base/sun.security.ssl.SSLSocketImpl.readApplicationRecord(Unknown Source)
    at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(Unknown Source)
    at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
    at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
    at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280)
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
    at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
    at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157)
    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
    at org.springframework.http.client.HttpComponentsClientHttpRequest.executeInternal(HttpComponentsClientHttpRequest.java:87)
    at org.springframework.http.client.AbstractBufferingClientHttpRequest.executeInternal(AbstractBufferingClientHttpRequest.java:48)
    at org.springframework.http.client.AbstractClientHttpRequest.execute(AbstractClientHttpRequest.java:53)
    at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:739)
    at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:674)
    at org.springframework.web.client.RestTemplate.exchange(RestTemplate.java:583)
....

Configuration:

public static RestTemplate create(final int maxTotal, final int defaultMaxPerRoute,
                                  final int connectTimeout, final int readTimeout,
                                  final String userAgent) {
    final Registry<ConnectionSocketFactory> schemeRegistry = RegistryBuilder.<ConnectionSocketFactory>create()
            .register("http", PlainConnectionSocketFactory.getSocketFactory())
            .register("https", SSLConnectionSocketFactory.getSocketFactory())
            .build();

    final PoolingHttpClientConnectionManager connManager = new PoolingHttpClientConnectionManager(schemeRegistry);
    connManager.setMaxTotal(maxTotal);
    connManager.setDefaultMaxPerRoute(defaultMaxPerRoute);

    final CloseableHttpClient httpClient = HttpClients.custom()
            .setConnectionManager(connManager)
            .setUserAgent(userAgent)
            .setDefaultRequestConfig(RequestConfig.custom()
                                             .setConnectTimeout(connectTimeout)
                                             .setSocketTimeout(readTimeout)
                                             .setExpectContinueEnabled(false).build())
            .build();

    return new RestTemplateBuilder()
            .requestFactory(() -> new HttpComponentsClientHttpRequestFactory(httpClient))
            .build();
}

Has anyone experienced this issue? When I turn on debug logs on the http client, it is overflowing with noise and I am unable to discern anything useful...

Humdinger answered 12/11, 2020 at 19:41 Comment(8)

Do you have a way to contact the people maintaining the upstream server? Perhaps the upstream server is load balancing your requests between a pool of servers, one of which is misconfigured. Or perhaps the server on the other side got rebooted. – Fundamentalism 22/11, 2020 at 19:11

In the meanwhile, you could add some retry logic, and see if a second and third attempt fail as well. – Fundamentalism 22/11, 2020 at 19:15

Hi Urosh, were you able to fix this issue? I'm facing the exact the problem as well. i doubted it's about incorrect tls version but that isnt the issue as well. – Lally 20/1, 2021 at 9:53

Hi, I am still testing the method mentioned in this answer: https://mcmap.net/q/726128/-spring-boot-random-quot-sslexception-connection-reset-quot-in-kubernetes-with-jdk11 and the errors have reduced but not completely gone so I am trying to tweak the timeout config and see what else can be done. When I am sure the solution is correct, I will either accept that answer or post an answer myself that has resolved the issue. – Humdinger 21/1, 2021 at 13:11

Hi @UroshT. Did you find anything? I am facing the similar issue, I have observed that this issue started happening after moving to Docker. – Roughspoken 28/3, 2021 at 10:43

Nope, there are things that have lowered the amount of errors but nothing that has fixed the root cause – Humdinger 28/3, 2021 at 17:10

Hi @UroshT. any luck with this issue? I noticed we had an elevated rate when we removed LinkerD service Mesh from our cluster. This issue is also very particular to Java services – Selfsealing 23/5, 2021 at 12:45

Still no fix and not really working on it actively since it only impacts like 0.2% of requests. But it is annoying, more than anything. Will post here if I fix it – Humdinger 23/5, 2021 at 16:23

We had a similar problem when migrating to AWS/Kubernetes. I've found out why.

You're using a connection pool. The default behavior of the PoolingHttpClientConnectionManager is that it will reuse connections. So connections will not be closed immediately when your request is done. This will save resources by not having to reconnect all the time.

A Kubernetes cluster uses a NAT (Network Address Translation) for outgoing connections. When a connection is not used for a certain amount of time, the connection will be removed from the NAT-table, and the connection will be broken. This causes the seemingly random SSLExceptions.

On AWS, connections will be removed from the NAT-table when it is Idle for 350 seconds. Other Kubernetes instances might have other settings.

See https://docs.aws.amazon.com/vpc/latest/userguide/nat-gateway-troubleshooting.html

The solution:

Disable connection-reuse:

final CloseableHttpClient closeableHttpClient = HttpClients.custom()
    .setConnectionReuseStrategy(NoConnectionReuseStrategy.INSTANCE)
    .setConnectionManager(poolingHttpClientConnectionManager)
    .build();

Or, let the httpClient evict connections that are idle for too long:

return HttpClients.custom()
            .evictIdleConnections(300, TimeUnit.SECONDS)  //Read the javadocs, may not be used when the instance of HttpClient is created inside an EJB container.
            .setConnectionManager(poolingHttpClientConnectionManager)
            .build();

Or call setConnectionKeepAliveStrategy(....) with a custom KeepAliveStrategy that will never return -1 or a timeout with a value of more than 300 seconds .

Histrionic answered 24/8, 2021 at 14:51 Comment(4)

Have you noticed that this new connection config impacts performance in any way? – Humdinger 24/8, 2021 at 15:35

We are not in production yet. Disabling connection-reuse will have an impact if you do a lot of requests. But I think the second en third option will not have an significant impact, because when your application does a lot of calls, your connections will never be idle, so this change does not change anything. When your application does not a lot of calls, it will now have to recreate a new connection every 5 minutes (worst case). That is not that much. – Histrionic 24/8, 2021 at 15:44

So does this mean that if the NAT Gateway timesout, the connection is not returned to the pool, but if we do the .evictIdleConnections(300, TimeUnit.SECONDS) or the setConnectionKeepAliveStrategy(....) with a custom KeepAliveStrategy that will never return -1 or a timeout with a value of more than 300 seconds then it will be returned back to the pool in a reusable state? – Smack 16/12, 2022 at 15:15

@Smack no, the idle connections will be evicted from the pool, but that is what we want. The bug is that connections that are idle for longer than 300 seconds will break, so our fix is to close and remove longtime idle connections from the pool. The pool will create new connections if required. – Histrionic 29/1, 2023 at 11:1

I will share my experience on this error probably it is the same problem you are facing. Comparing the stack trace which I had.

As this is happening randomly is the key phrase which I suspect that this is the same problem.

HTTP connections are made through an HTTP client library(Apache HTTP Client).

HTTP client usually manages, a re-usable pool of connections. This pool has a limit. In our case, the pool of connections is sometimes(Randomly) getting totally occupied. There are no more free connections which can be used anymore.

You can either increase the pool size
Implement a back-off retry mechanism which will try to grab a connection from the pool of HTTP connections when there is a failure on executing the HTTP request successfully.

If you wonder how to tune this underlying HTTP Client that is being used in sprint boot, check out this post.

Ethnomusicology answered 17/11, 2020 at 18:43 Comment(0)

I guess the issue is related with k8s.

if you use flannel as k8s network, please check flannel status and find if it restarts more times. use below command

kubectl get pod -n kube-system | grep flannel

what version of your linux kennel? if not 4.x version or above, please upgrade to 4.x.

# to check linux kennel version
uname -sr 

# upgrade step
1)
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-4.el7.elrepo.noarch.rpm
yum --enablerepo=elrepo-kernel -y install kernel-lt
2) open and edit /etc/default/grub, and set "GRUB_DEFAULT=0"
3) grub2-mkconfig -o /boot/grub2/grub.cfg
4) reboot

Wish it useful to solving issue.

Motorize answered 16/11, 2020 at 1:31 Comment(0)

SSL stacktrace like this could be caused by many different reasons which could have nothing to do with SSL itself. That stacktrace will not help you enough, and furthermore this issue has nothing to do with spring, resttemplate etc.

What will help you is if you implement a logging/monitoring/tracing framework I use elasticsearch. Monitor the behavior for a couple of days, make sure you record as much information in these logs as needed, such as the container id, connection details (when it was initiated, etc). You might find that for example after a connection has lived for a certain amount of time, (eg 1 hour) this occurs, and if you simply make connections live for less time, the issue will goes away.

This way you may be able to fix the issue without needing to figure out the root cause, as that could be many days of work and get you no where. Rather tinkering with the connection parameters will resolve your issue potentially. But for that you need more visibility as the info you've posted so far is not enough to troubleshoot the issue.

Nmr answered 17/11, 2020 at 1:50 Comment(0)

try remove dns

8.8.8.8 to empty

in Windows in cmd admin try: netsh winsock reset

Dishonest answered 13/8, 2024 at 2:52 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Context:

Problem:

Configuration:

Recommended topics

Hot tags