Podman pod disappears after a few days, but process is still running and listening on a given port
Asked Answered
B

1

5

I am running an Elasticsearch container as Podman pod using podman play kube and a yaml definition of a pod. Pod is created, cluster of three nodes is created and everything works as expected. But: Podman pod dies after a few days of staying idle.

Podman podman ps command says:

ERRO[0000] Error refreshing container af05fafe31f6bfb00c2599255c47e35813ecf5af9bbe6760ae8a4abffd343627: error acquiring lock 1 for container af05fafe31f6bfb00c2599255c47e35813ecf5af9bbe6760ae8a4abffd343627: file exists
ERRO[0000] Error refreshing container b4620633d99f156bb59eb327a918220d67145f8198d1c42b90d81e6cc29cbd6b: error acquiring lock 2 for container b4620633d99f156bb59eb327a918220d67145f8198d1c42b90d81e6cc29cbd6b: file exists
ERRO[0000] Error refreshing pod 389b0c34313d9b23ecea3faa0e494e28413bd15566d66297efa9b5065e025262: error retrieving lock 0 for pod 389b0c34313d9b23ecea3faa0e494e28413bd15566d66297efa9b5065e025262: file exists
POD ID        NAME               STATUS   CREATED     INFRA ID      # OF CONTAINERS
389b0c34313d  elasticsearch-pod  Created  1 week ago  af05fafe31f6  2

What's weird is that the process is still listening if we try to find the process id listening on port 9200 or 9300:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp6       0      0 :::9200                 :::*                    LISTEN      1328607/containers-
tcp6       0      0 :::9300                 :::*                    LISTEN      1328607/containers-

The process ID that is hanging (and making the process still listening is):

user+ 1339220  0.0  0.1  45452  8284 ?        S    Jan11   2:19 /bin/slirp4netns --disable-host-loopback --mtu 65520 --enable-sandbox --enable-seccomp -c -e 3 -r 4 --netns-type=path /tmp/run-1002/netns/cni-e4bb2146-d04e-c3f1-9207-380a234efa1f tap0

The only actions I do to the pod is regular: podman pod stop, podman pod rm and podman play kube that is starting pod.

What can be causing such strange behaviour of Podman? What may be causing the lock not to be released properly?

System information:

NAME="Red Hat Enterprise Linux"
VERSION="8.3 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.3"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.3 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.3:GA"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.3
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.3"
Red Hat Enterprise Linux release 8.3 (Ootpa)
Red Hat Enterprise Linux release 8.3 (Ootpa)

Podman version:

podman --version
podman version 2.2.1
Bought answered 22/2, 2021 at 22:15 Comment(0)
H
6

The workaround that worked for me is to add this configuration file from the Podman repository [1] under /usr/lib/tmpfiles.d/ and /etc/tmpfiles.d/, in this way we are preventing the removal of Podman temporary files from /tmp directory [2]. As stated in [3], additionally CNI leaves Network information in /var/lib/cni/networks when the system crashes or containers do not shut down properly. This behaviour has been fixed in the latest Podman release [4] and it happens when using rootless Podman.

Workaround

First, check the runRoot default directory set for your Podman rootless user:

podman info | grep runRoot

Create the temporary configuration file:

sudo vim /usr/lib/tmpfiles.d/podman.conf

Add the following content, replacing /tmp/podman-run-* by your default runRoot directory. E.g. If your output is /tmp/run-6695/containers then use: x /tmp/run-*

# /tmp/podman-run-* directory can contain content for Podman containers that have run
# for many days. This following line prevents systemd from removing this content.
x /tmp/podman-run-*
x /tmp/containers-user-*
D! /run/podman 0700 root root
D! /var/lib/cni/networks

Copy the temporary file from /usr/lib/tmpfiles.d to /etc/tmpfiles.d/

sudo cp -p /usr/lib/tmpfiles.d/podman.conf /etc/tmpfiles.d/

After you have done all the steps according to your configuration, the error should disappear.

References

  1. https://github.com/containers/podman/blob/master/contrib/tmpfile/podman.conf
  2. https://bugzilla.redhat.com/show_bug.cgi?id=1888988#c9
  3. https://github.com/containers/podman/commit/2e0a9c453b03d2a372a3ab03b9720237e93a067c
  4. https://github.com/containers/podman/pull/8241
Henning answered 23/2, 2021 at 12:4 Comment(3)
I've propagated the change to servers, let's see how it helps. Hard to test if the issue is showing itself after a few days. ;) Thank you!Bought
For anyone taking this fix as a solution, make sure that runRoot corresponds to the directory that you are setting to be ignored in tmpfiles.d, in my case I had to change /tmp/podman-run-* to /tmp/run-* (e.g. /tmp/run-1001). Consult podman info: runRoot: /tmp/run-1002/containers. Relevant Podman issue: github.com/containers/podman/issues/9663Bought
The workaround solution by @Henning seems not working. We still see same error even after copying podman.conf to /etc/tmpfiles.d/ERRO[0000] Error refreshing container 28d7f360049bd3c3bd7f55baf78af3e11e3baf9ad489586899c928767f51cb2d: error acquiring lock 0 for container 28d7f360049bd3c3bd7f55baf78af3e11e3baf9ad489586899c928767f51cb2d: file existsAlina

© 2022 - 2024 — McMap. All rights reserved.