Specifically, I'm running a Flask app with default workers in gunicorn. I'm trying to figure out how to debug / trace what is happening when a worker is killed due to timeout while serving a request. Is there a way to get a stack trace or profile the request to debug this?
Honestly a very good question, I was not sure if it is even possible to achieve the same or not. But as I started digging I found lot of interesting threads
Showing the stack trace from a running Python application
Get stacktrace from stuck python process
https://github.com/khamidou/lptrace
https://gist.github.com/reywood/e221c4061bbf2eccea885c9b2e4ef496
So first I created a simple flask app with below code
app.py
from flask import Flask
app = Flask(__name__)
import time
def a():
b()
def b():
c ()
def c():
i = 0
while i < 900:
time.sleep(1)
i += 1
@app.route('/', defaults={'path': ''})
@app.route('/<path:path>')
def catch_all(path):
a()
return 'You want path: %s' % path
if __name__ == '__main__':
app.run()
wsgi.py
from app import app
if __name__ == "__main__":
app.run()
Now running the app like below and doing curl localhost:8000/abc
$ gunicorn wsgi:app
[2019-08-01 08:19:06 +0000] [26825] [INFO] Starting gunicorn 19.9.0
[2019-08-01 08:19:06 +0000] [26825] [INFO] Listening at: http://127.0.0.1:8000 (26825)
[2019-08-01 08:19:06 +0000] [26825] [INFO] Using worker: sync
[2019-08-01 08:19:06 +0000] [26828] [INFO] Booting worker with pid: 26828
[2019-08-01 08:19:40 +0000] [26825] [CRITICAL] WORKER TIMEOUT (pid:26828)
[2019-08-01 08:19:40 +0000] [26828] [INFO] Worker exiting (pid: 26828)
[2019-08-01 08:19:40 +0000] [26832] [INFO] Booting worker with pid: 26832
Now what we need is a hook which can be called before the worked is killed. gunicorn
supports server events in the configuration file
So now we create a config file
gunicorn_config.py
timeout = 3
def worker_abort(worker):
pid = worker.pid
print("worker is being killed - {}".format(pid))
And our output is now
$ gunicorn -c gunicorn_config.py wsgi:app
[2019-08-01 08:22:17 +0000] [26837] [INFO] Starting gunicorn 19.9.0
[2019-08-01 08:22:17 +0000] [26837] [INFO] Listening at: http://127.0.0.1:8000 (26837)
[2019-08-01 08:22:17 +0000] [26837] [INFO] Using worker: sync
[2019-08-01 08:22:17 +0000] [26840] [INFO] Booting worker with pid: 26840
[2019-08-01 08:22:22 +0000] [26837] [CRITICAL] WORKER TIMEOUT (pid:26840)
worker is being killed - 26840
[2019-08-01 08:22:22 +0000] [26840] [INFO] Worker exiting (pid: 26840)
[2019-08-01 08:22:22 +0000] [26844] [INFO] Booting worker with pid: 26844
This is good, now we need mix our previous knowledge on pyrasite
and this to get the stack. So we updated the config file like below
gunicorn_config.py
timeout = 3
__code_dump_stack__ = """
import sys, traceback
for thread, frame in sys._current_frames().items():
print('Thread 0x%x' % thread)
traceback.print_stack(frame)
print()
"""
def dump_stack_for_process(pid):
import pyrasite
ipc = pyrasite.PyrasiteIPC(pid)
ipc.connect()
print(ipc.cmd(__code_dump_stack__))
ipc.close()
def worker_abort(worker):
pid = worker.pid
print("worker is being killed - {}".format(pid))
dump_stack_for_process(pid)
And now our output is
$ [2019-08-01 08:25:29 +0000] [26848] [INFO] Starting gunicorn 19.9.0
[2019-08-01 08:25:29 +0000] [26848] [INFO] Listening at: http://127.0.0.1:8000 (26848)
[2019-08-01 08:25:29 +0000] [26848] [INFO] Using worker: sync
[2019-08-01 08:25:29 +0000] [26851] [INFO] Booting worker with pid: 26851
[2019-08-01 08:25:38 +0000] [26848] [CRITICAL] WORKER TIMEOUT (pid:26851)
worker is being killed - 26851
Thread 0x7ff0a7a4b700
File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "<string>", line 72, in run
File "<string>", line 92, in on_command
File "<string>", line 6, in <module>
Thread 0x7ff0ac512700
File "/home/vagrant/.local/bin/gunicorn", line 11, in <module>
sys.exit(run())
File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/app/wsgiapp.py", line 61, in run
WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/app/base.py", line 223, in run
super(Application, self).run()
File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/app/base.py", line 72, in run
Arbiter(self).run()
File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 203, in run
self.manage_workers()
File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 545, in manage_workers
self.spawn_workers()
File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 616, in spawn_workers
self.spawn_worker()
File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 583, in spawn_worker
worker.init_process()
File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/base.py", line 134, in init_process
self.run()
File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 124, in run
self.run_for_one(timeout)
File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 68, in run_for_one
self.accept(listener)
File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 30, in accept
self.handle(listener, client, addr)
File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 135, in handle
self.handle_request(listener, req, client, addr)
File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 176, in handle_request
respiter = self.wsgi(environ, resp.start_response)
File "/home/vagrant/.local/lib/python3.5/site-packages/flask/app.py", line 2463, in __call__
return self.wsgi_app(environ, start_response)
File "/home/vagrant/.local/lib/python3.5/site-packages/flask/app.py", line 2446, in wsgi_app
response = self.full_dispatch_request()
File "/home/vagrant/.local/lib/python3.5/site-packages/flask/app.py", line 1949, in full_dispatch_request
rv = self.dispatch_request()
File "/home/vagrant/.local/lib/python3.5/site-packages/flask/app.py", line 1935, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/home/vagrant/remotedebug/app.py", line 20, in catch_all
a()
File "/home/vagrant/remotedebug/app.py", line 6, in a
b()
File "/home/vagrant/remotedebug/app.py", line 9, in b
c ()
File "/home/vagrant/remotedebug/app.py", line 14, in c
time.sleep(1)
File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/base.py", line 195, in handle_abort
self.cfg.worker_abort(self)
File "gunicorn_config.py", line 23, in worker_abort
dump_stack_for_process(pid)
File "gunicorn_config.py", line 17, in dump_stack_for_process
print(ipc.cmd(__code_dump_stack__))
File "/home/vagrant/.local/lib/python3.5/site-packages/pyrasite/ipc.py", line 161, in cmd
return self.recv()
File "/home/vagrant/.local/lib/python3.5/site-packages/pyrasite/ipc.py", line 174, in recv
header_data = self.recv_bytes(4)
File "/home/vagrant/.local/lib/python3.5/site-packages/pyrasite/ipc.py", line 187, in recv_bytes
chunk = self.sock.recv(n - len(data))
[2019-08-01 08:25:38 +0000] [26851] [INFO] Worker exiting (pid: 26851)
[2019-08-01 08:25:38 +0000] [26862] [INFO] Booting worker with pid: 26862
The stack trace is large but it gives us what we need
pyrasite
? –
Karlene exec(__code_dump_stack__)
instead of the few pyrasite lines and it prints a correct stacktrace. The docs also mention that worker_abort
runs in the worker process so it makes sense. –
Lutero Just wanted to add to the accepted answer that there's, also, another way to get similar results by using the built-in Python 3 module faulthandler
introduced in v3.3 https://docs.python.org/dev/library/faulthandler.html
You can add the following two lines to wsgi.py mentioned in the accepted answer:
import faulthandler
faulthandler.enable()
When timeout occurs the following output will be printed to stderr:
[2020-12-24 13:38:32 +0000] [31304] [INFO] Starting gunicorn 20.0.4
[2020-12-24 13:38:32 +0000] [31304] [INFO] Listening at: http://0.0.0.0:8888 (31304)
[2020-12-24 13:38:32 +0000] [31304] [INFO] Using worker: sync
[2020-12-24 13:38:32 +0000] [31307] [INFO] Booting worker with pid: 31307
[2020-12-24 13:38:55 +0000] [31304] [CRITICAL] WORKER TIMEOUT (pid:31307)
Fatal Python error: Aborted
Current thread 0x00007f411d781700 (most recent call first):
File "/tmp/app.py", line 14 in c
File "/tmp/app.py", line 9 in b
File "/tmp/app.py", line 6 in a
File "/tmp/app.py", line 20 in catch_all
File "/tmp/venv/lib/python3.5/site-packages/flask/app.py", line 1936 in dispatch_request
File "/tmp/venv/lib/python3.5/site-packages/flask/app.py", line 1950 in full_dispatch_request
File "/tmp/venv/lib/python3.5/site-packages/flask/app.py", line 2447 in wsgi_app
File "/tmp/venv/lib/python3.5/site-packages/flask/app.py", line 2464 in __call__
File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 175 in handle_request
File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 134 in handle
File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 29 in accept
File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 67 in run_for_one
File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 123 in run
File "/tmp/venv/lib/python3.5/site-packages/gunicorn/workers/base.py", line 140 in init_process
File "/tmp/venv/lib/python3.5/site-packages/gunicorn/arbiter.py", line 583 in spawn_worker
File "/tmp/venv/lib/python3.5/site-packages/gunicorn/arbiter.py", line 616 in spawn_workers
File "/tmp/venv/lib/python3.5/site-packages/gunicorn/arbiter.py", line 545 in manage_workers
File "/tmp/venv/lib/python3.5/site-packages/gunicorn/arbiter.py", line 202 in run
File "/tmp/venv/lib/python3.5/site-packages/gunicorn/app/base.py", line 72 in run
File "/tmp/venv/lib/python3.5/site-packages/gunicorn/app/base.py", line 228 in run
File "/tmp/venv/lib/python3.5/site-packages/gunicorn/app/wsgiapp.py", line 58 in run
File "/tmp/venv/bin/gunicorn", line 8 in <module>
[2020-12-24 13:38:55 +0000] [31307] [INFO] Worker exiting (pid: 31307)
[2020-12-24 13:38:55 +0000] [31503] [INFO] Booting worker with pid: 31503
Also add to the accepted answer, I find import traceback; traceback.print_stack()
works for me since worker_abort
runs int he context of worker process. So, just need to change the gunicorn_config.py
to something like this
import traceback
timeout = 3
def worker_abort(worker):
pid = worker.pid
print("worker is being killed - {}".format(pid))
traceback.print_stack()
© 2022 - 2024 — McMap. All rights reserved.