Bad gateway with traefik and docker swarm during service update
Asked Answered
S

3

5

I’m trying to use traefik with docker swarm but i’m having troubles during service updates. I run a stack deploy or service update the service goes down for some seconds

How to reproduce:

1 - Create a Dockerfile:

FROM jwilder/whoami
RUN echo $(date) > daniel.txt

2 - Build 2 demo images:

$ docker build -t whoami:01 .
$ docker build -t whoami:02 .

3 - Create a docker-compose.yml:

version: '3.5'

services:
  app:
    image: whoami:01
    ports:
      - 81:8000
    deploy:
      replicas: 2
      restart_policy:
        condition: on-failure
      update_config:
        parallelism: 1
        failure_action: rollback
      labels:
        - traefik.enable=true
        - traefik.backend=app
        - traefik.frontend.rule=Host:localhost
        - traefik.port=8000
        - traefik.docker.network=web
    networks:
      - web

  reverse-proxy:
    image: traefik
    command: 
      - "--api"
      - "--docker"
      - "--docker.swarmMode"
      - "--docker.domain=localhost"
      - "--docker.watch"
      - "--docker.exposedbydefault=false"
      - "--docker.network=web"
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
      update_config:
        parallelism: 1
        failure_action: rollback
      placement:
        constraints:
          - node.role == manager
    networks:
      - web
    ports:
      - 80:80
      - 8080:8080
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

networks:
  web:
    external: true

4 - Deploy the stack:

$ docker stack deploy -c docker-compose.yml stack_name

5 - Curl to get the service response:

$ while true ; do sleep .1; curl localhost; done

You should see something like this:

I'm adc1473258e9
I'm bc82ea92b560
I'm adc1473258e9
I'm bc82ea92b560

That means the load balance is working

6 - Update the service

$ docker service update --image whoami:02 got_app

The traefik respond with Bad Gateway when should be zero downtime.

How to fix it?

Sex answered 16/4, 2019 at 19:13 Comment(0)
C
6

Bad gateway means traefik is configured to forward requests, but it's not able to reach the container on the ip and port that it's configured to use. Common issues causing this are:

  • traefik and the service on different docker networks
  • service exists in multiple networks and traefik picks the wrong one
  • wrong port being used to connect to the container (use the container port and make sure it's listening on all interfaces, aka 0.0.0.0)

From the comments, this is only happening during the deploy, which means traefik is hitting containers before they are ready to receive requests, or while they are being stopped.

You can configure containers with a healthcheck and send request through swarm mode's VIP using a Dockerfile that looks like:

FROM jwilder/whoami
RUN echo $(date) >/build-date.txt
HEALTHCHECK --start-period=30s --retries=1 CMD wget -O - -q http://localhost:8000

And then in the docker-compose.yml:

  labels:
    - traefik.enable=true
    - traefik.backend=app
    - traefik.backend.loadbalancer.swarm=true
    ...

And I would also configure the traefik service with the following options:

  - "--retry.attempts=2"
  - "--forwardingTimeouts.dialTimeout=1s"

However, traefik will keep a connection open and the VIP will continue to send all requests to the same backend container over that same connection. What you can do instead is have traefik itself perform the healthcheck:

  labels:
    - traefik.enable=true
    - traefik.backend=app
    - traefik.backend.healthcheck.path=/
    ...

I would still leave the healthcheck on the container itself so Docker gives the container time to start before stopping the other container. And leave the retry option on the traefik service so any request to a stopping container, or just one that hasn't been detected by the healthcheck, has a chance to try try again.


Here's the resulting compose file that I used in my environment:

version: '3.5'

services:
  app:
    image: test-whoami:1
    ports:
      - 6081:8000
    deploy:
      replicas: 2
      restart_policy:
        condition: on-failure
      update_config:
        parallelism: 1
        failure_action: rollback
      labels:
        - traefik.enable=true
        - traefik.backend=app
        - traefik.backend.healthcheck.path=/
        - traefik.frontend.rule=Path:/
        - traefik.port=8000
        - traefik.docker.network=test_web
    networks:
      - web

  reverse-proxy:
    image: traefik
    command:
      - "--api"
      - "--retry.attempts=2"
      - "--forwardingTimeouts.dialTimeout=1s"
      - "--docker"
      - "--docker.swarmMode"
      - "--docker.domain=localhost"
      - "--docker.watch"
      - "--docker.exposedbydefault=false"
      - "--docker.network=test_web"
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
      update_config:
        parallelism: 1
        failure_action: rollback
      placement:
        constraints:
          - node.role == manager
    networks:
      - web
    ports:
      - 6080:80
      - 6880:8080
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

networks:
  web:

Dockerfile is as quoted above. Image names, ports, network names, etc were changed to avoid conflicting with other things in my environment.

Cryptogam answered 17/4, 2019 at 14:58 Comment(6)
docker network ls -> bridge, docker_gwbridge, host, ingress, none, web. I do not have stack_name_webSex
@Sex woops, just looked again and saw that you defined it external. Is that an overlay network? Can you connect to port 81 on one host and reach your whoami container running on another host?Cryptogam
Also, do the gateway errors go away by themselves and only appear during a deployment?Cryptogam
Yes, the web is an overlay network. I tested only in one node. The gateway errors appear only during the service update.Sex
@Sex an error just during the deploy points to a missing or misconfigured healthcheck. Without a healthcheck, requests get sent to the new container before it's ready.Cryptogam
I tried with healthcheck and does not worked yet, could you test or provide an example?Sex
V
2

As of today (jun/2021) Traefik can't drain the connections during update.

To achieve a zero-downtime rolling update you should delegate the load-balancing to docker swarm itself:

# trafik v2
# docker-compose.yml

services:
  your_service:
    deploy:
      labels:
        - traefik.docker.lbswarm=true

From the docs:

Enables Swarm's inbuilt load balancer (only relevant in Swarm Mode).

If you enable this option, Traefik will use the virtual IP provided by docker swarm instead of the containers IPs. Which means that Traefik will not perform any kind of load balancing and will delegate this task to swarm.

Further info:

https://github.com/traefik/traefik/issues/41

https://github.com/traefik/traefik/issues/1480

Vary answered 13/6, 2021 at 3:59 Comment(1)
I added this label to my service but it still gives me a bad gateway error after having restarted the whole Swarm stack. I posted at https://mcmap.net/q/2032718/-bad-gateway-as-traefik-fails-to-point-to-a-new-service-instance-after-a-failed-health-check/958373Dairy
I
0

The problem is, that traefik is not listening for kill events and instead waits until a container is dead, which leaves too much time between kill and die where a container is unresponsive, but still registered in the traefik load balancer.

So far the only solution was to use the docker swarm load balancer as suggested by Daniel Silveira.
However, with lbswarm you won't be able to use traefik features like sticky sessions.

I wrote a patch to also listen to kill events, which in my tests proofed to produce no more gateway errors / failed requests!

You can track my traefik issue.
I will update this post when the kill event gets implemented in the official traefik.

Note:
If you only have 1 replica, make sure you use order: start-first in update_config so docker swarm starts the new container first and waits until it is running before stopping the old one.
The default is stop-first!
Otherwise you will have gateway errors because of the downtime.

Intimidate answered 5/11 at 13:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.