Is it possible to get a stack trace when a gunicorn worker hits a timeout?
Asked Answered
Z

3

14

Specifically, I'm running a Flask app with default workers in gunicorn. I'm trying to figure out how to debug / trace what is happening when a worker is killed due to timeout while serving a request. Is there a way to get a stack trace or profile the request to debug this?

Zoi answered 23/7, 2019 at 15:10 Comment(0)
K
16

Honestly a very good question, I was not sure if it is even possible to achieve the same or not. But as I started digging I found lot of interesting threads

Showing the stack trace from a running Python application

Get stacktrace from stuck python process

https://github.com/khamidou/lptrace

https://gist.github.com/reywood/e221c4061bbf2eccea885c9b2e4ef496

So first I created a simple flask app with below code

app.py

from flask import Flask
app = Flask(__name__)
import time

def a():
  b()

def b():
  c ()

def c():
   i = 0
   while i < 900:
       time.sleep(1)
       i += 1

@app.route('/', defaults={'path': ''})
@app.route('/<path:path>')
def catch_all(path):
    a()
    return 'You want path: %s' % path

if __name__ == '__main__':
    app.run()

wsgi.py

from app import app

if __name__ == "__main__":
    app.run()

Now running the app like below and doing curl localhost:8000/abc

$ gunicorn wsgi:app
[2019-08-01 08:19:06 +0000] [26825] [INFO] Starting gunicorn 19.9.0
[2019-08-01 08:19:06 +0000] [26825] [INFO] Listening at: http://127.0.0.1:8000 (26825)
[2019-08-01 08:19:06 +0000] [26825] [INFO] Using worker: sync
[2019-08-01 08:19:06 +0000] [26828] [INFO] Booting worker with pid: 26828
[2019-08-01 08:19:40 +0000] [26825] [CRITICAL] WORKER TIMEOUT (pid:26828)
[2019-08-01 08:19:40 +0000] [26828] [INFO] Worker exiting (pid: 26828)
[2019-08-01 08:19:40 +0000] [26832] [INFO] Booting worker with pid: 26832

Now what we need is a hook which can be called before the worked is killed. gunicorn supports server events in the configuration file

So now we create a config file

gunicorn_config.py

timeout = 3

def worker_abort(worker):
    pid = worker.pid
    print("worker is being killed - {}".format(pid))

And our output is now

$ gunicorn -c gunicorn_config.py wsgi:app
[2019-08-01 08:22:17 +0000] [26837] [INFO] Starting gunicorn 19.9.0
[2019-08-01 08:22:17 +0000] [26837] [INFO] Listening at: http://127.0.0.1:8000 (26837)
[2019-08-01 08:22:17 +0000] [26837] [INFO] Using worker: sync
[2019-08-01 08:22:17 +0000] [26840] [INFO] Booting worker with pid: 26840
[2019-08-01 08:22:22 +0000] [26837] [CRITICAL] WORKER TIMEOUT (pid:26840)
worker is being killed - 26840
[2019-08-01 08:22:22 +0000] [26840] [INFO] Worker exiting (pid: 26840)
[2019-08-01 08:22:22 +0000] [26844] [INFO] Booting worker with pid: 26844

This is good, now we need mix our previous knowledge on pyrasite and this to get the stack. So we updated the config file like below

gunicorn_config.py

timeout = 3

__code_dump_stack__ = """
import sys, traceback

for thread, frame in sys._current_frames().items():
    print('Thread 0x%x' % thread)
    traceback.print_stack(frame)
    print()
"""

def dump_stack_for_process(pid):
    import pyrasite

    ipc = pyrasite.PyrasiteIPC(pid)
    ipc.connect()
    print(ipc.cmd(__code_dump_stack__))
    ipc.close()

def worker_abort(worker):
    pid = worker.pid
    print("worker is being killed - {}".format(pid))
    dump_stack_for_process(pid)

And now our output is

$ [2019-08-01 08:25:29 +0000] [26848] [INFO] Starting gunicorn 19.9.0
[2019-08-01 08:25:29 +0000] [26848] [INFO] Listening at: http://127.0.0.1:8000 (26848)
[2019-08-01 08:25:29 +0000] [26848] [INFO] Using worker: sync
[2019-08-01 08:25:29 +0000] [26851] [INFO] Booting worker with pid: 26851
[2019-08-01 08:25:38 +0000] [26848] [CRITICAL] WORKER TIMEOUT (pid:26851)
worker is being killed - 26851
Thread 0x7ff0a7a4b700
  File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "<string>", line 72, in run
  File "<string>", line 92, in on_command
  File "<string>", line 6, in <module>

Thread 0x7ff0ac512700
  File "/home/vagrant/.local/bin/gunicorn", line 11, in <module>
    sys.exit(run())
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/app/wsgiapp.py", line 61, in run
    WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/app/base.py", line 223, in run
    super(Application, self).run()
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/app/base.py", line 72, in run
    Arbiter(self).run()
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 203, in run
    self.manage_workers()
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 545, in manage_workers
    self.spawn_workers()
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 616, in spawn_workers
    self.spawn_worker()
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 583, in spawn_worker
    worker.init_process()
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/base.py", line 134, in init_process
    self.run()
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 124, in run
    self.run_for_one(timeout)
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 68, in run_for_one
    self.accept(listener)
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 30, in accept
    self.handle(listener, client, addr)
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 135, in handle
    self.handle_request(listener, req, client, addr)
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 176, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/home/vagrant/.local/lib/python3.5/site-packages/flask/app.py", line 2463, in __call__
    return self.wsgi_app(environ, start_response)
  File "/home/vagrant/.local/lib/python3.5/site-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/vagrant/.local/lib/python3.5/site-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/vagrant/.local/lib/python3.5/site-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/vagrant/remotedebug/app.py", line 20, in catch_all
    a()
  File "/home/vagrant/remotedebug/app.py", line 6, in a
    b()
  File "/home/vagrant/remotedebug/app.py", line 9, in b
    c ()
  File "/home/vagrant/remotedebug/app.py", line 14, in c
    time.sleep(1)
  File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/base.py", line 195, in handle_abort
    self.cfg.worker_abort(self)
  File "gunicorn_config.py", line 23, in worker_abort
    dump_stack_for_process(pid)
  File "gunicorn_config.py", line 17, in dump_stack_for_process
    print(ipc.cmd(__code_dump_stack__))
  File "/home/vagrant/.local/lib/python3.5/site-packages/pyrasite/ipc.py", line 161, in cmd
    return self.recv()
  File "/home/vagrant/.local/lib/python3.5/site-packages/pyrasite/ipc.py", line 174, in recv
    header_data = self.recv_bytes(4)
  File "/home/vagrant/.local/lib/python3.5/site-packages/pyrasite/ipc.py", line 187, in recv_bytes
    chunk = self.sock.recv(n - len(data))


[2019-08-01 08:25:38 +0000] [26851] [INFO] Worker exiting (pid: 26851)
[2019-08-01 08:25:38 +0000] [26862] [INFO] Booting worker with pid: 26862

The stack trace is large but it gives us what we need

Karlene answered 1/8, 2019 at 8:26 Comment(3)
is pyrasite needed here? from some experimentation, worker_abort seems to run on the worker processSaccharify
were you able to get the stack trace from that without pyrasite?Karlene
For what it's worth I just tried with exec(__code_dump_stack__) instead of the few pyrasite lines and it prints a correct stacktrace. The docs also mention that worker_abort runs in the worker process so it makes sense.Lutero
L
12

Just wanted to add to the accepted answer that there's, also, another way to get similar results by using the built-in Python 3 module faulthandler introduced in v3.3 https://docs.python.org/dev/library/faulthandler.html

You can add the following two lines to wsgi.py mentioned in the accepted answer:

import faulthandler
faulthandler.enable()

When timeout occurs the following output will be printed to stderr:

[2020-12-24 13:38:32 +0000] [31304] [INFO] Starting gunicorn 20.0.4
[2020-12-24 13:38:32 +0000] [31304] [INFO] Listening at: http://0.0.0.0:8888 (31304)
[2020-12-24 13:38:32 +0000] [31304] [INFO] Using worker: sync
[2020-12-24 13:38:32 +0000] [31307] [INFO] Booting worker with pid: 31307
[2020-12-24 13:38:55 +0000] [31304] [CRITICAL] WORKER TIMEOUT (pid:31307)
Fatal Python error: Aborted

Current thread 0x00007f411d781700 (most recent call first):
  File "/tmp/app.py", line 14 in c
  File "/tmp/app.py", line 9 in b
  File "/tmp/app.py", line 6 in a
  File "/tmp/app.py", line 20 in catch_all
  File "/tmp/venv/lib/python3.5/site-packages/flask/app.py", line 1936 in dispatch_request
  File "/tmp/venv/lib/python3.5/site-packages/flask/app.py", line 1950 in full_dispatch_request
  File "/tmp/venv/lib/python3.5/site-packages/flask/app.py", line 2447 in wsgi_app
  File "/tmp/venv/lib/python3.5/site-packages/flask/app.py", line 2464 in __call__
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 175 in handle_request
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 134 in handle
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 29 in accept
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 67 in run_for_one
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 123 in run
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/base.py", line 140 in init_process
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/arbiter.py", line 583 in spawn_worker
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/arbiter.py", line 616 in spawn_workers
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/arbiter.py", line 545 in manage_workers
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/arbiter.py", line 202 in run
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/app/base.py", line 72 in run
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/app/base.py", line 228 in run
  File "/tmp/venv/lib/python3.5/site-packages/gunicorn/app/wsgiapp.py", line 58 in run
  File "/tmp/venv/bin/gunicorn", line 8 in <module>
[2020-12-24 13:38:55 +0000] [31307] [INFO] Worker exiting (pid: 31307)
[2020-12-24 13:38:55 +0000] [31503] [INFO] Booting worker with pid: 31503
Lewellen answered 24/12, 2020 at 12:45 Comment(1)
I couldn't get this to work. As I understand it, faulthandler will kick in and print a traceback when the python interpreter receives certain signals. But when gunicorn kills its workers it appears to use SIGKILL which can't be caught, so faulthandler has no chance to get invoked from a signal handler. Maybe this is a recent (since you posted your answer) change in behaviour in gunicorn; I'm testing with version 20.1.0.Betweenwhiles
I
0

Also add to the accepted answer, I find import traceback; traceback.print_stack() works for me since worker_abort runs int he context of worker process. So, just need to change the gunicorn_config.py to something like this

import traceback

timeout = 3

def worker_abort(worker):
    pid = worker.pid
    print("worker is being killed - {}".format(pid))
    traceback.print_stack()
Isabellisabella answered 8/4, 2023 at 0:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.