Problem description
I am trying to create a Flask app that should:
- Be visible on localhost only, so no network slowdown
- Get quite a lot of data (30MB as a large numpy array) as input and output a relatively smaller amount of data (around 1MB).
I've made a quick test and run it with the Flask development server and it worked as expected. Scared by the red writing WARNING: This is a development server. Do not use it in a production deployment.
I tried putting it behind a WSGI server but both Waitress and GUnicorn achieved much slower results. Tests (on a toy problem with artificial input, tiny output, and fully replicable code) are below.
Code to run the tests
I've put these three files in a folder:
basic_flask_app.py (this here is supposed to do very little with the data it gets; the real code I have is a deep learning model that runs quite fast on GPU, but this example here is created to make the issue more extreme)
import numpy as np
from flask import Flask, request
from do_request import IS_SMALL_DATA, WIDTH, HEIGHT
app = Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict():
numpy_bytes = np.frombuffer(request.data, np.float32)
if IS_SMALL_DATA:
numpy_image = np.zeros((HEIGHT, WIDTH)) + numpy_bytes
else:
numpy_image = numpy_bytes.reshape(HEIGHT, WIDTH)
result = numpy_image.mean(axis=1).std(axis=0)
return result.tobytes()
if __name__ == '__main__':
app.run(host='localhost', port=80, threaded=False, processes=1)
[Edited: the original version of this question was missing the parameters threaded=False, processes=1
in the call to app.run
above, so the behaviour was not the same to GUnicorn and Waitress below, which instead are forced to single thread/process; I've added it now, and re-tested, the results don't change, Flask server is still fast after this change - if anything, faster]
do_request.py
import requests
import numpy as np
from tqdm import trange
WIDTH = 2500
HEIGHT = 3000
IS_SMALL_DATA = False
def main(url='http://127.0.0.1:80/predict'):
n = WIDTH * HEIGHT
if IS_SMALL_DATA:
np_image = np.zeros(1, dtype=np.float32)
else:
np_image = np.arange(n).astype(np.float32) / np.float32(n)
results = []
for _ in trange(50):
results.append(requests.post(url, data=np_image.tobytes()))
if __name__ == '__main__':
main()
waitress_server.py
from waitress import serve
import basic_flask_app
serve(basic_flask_app.app, host='127.0.0.1', port=80, threads=1)
Test results
I've run the tests running python do_requests.py
after starting the model with either of the following three commands:
python basic_flask_app.py
python waitress_server.py
gunicorn -w 1 basic_flask_app:app -b 127.0.0.1:80
With these three options, and toggling the IS_SMALL_DATA
flag (if True, only 4 bytes of data are transmitted; if False, 30MB) I got the following timings:
50 requests Flask Waitress GUnicorn
30MB input, 4B output: 00:01 (28.6 it/s) 00:11 (4.42 it/s) 00:11 (4.26 it/s)
4B input, 4B output: 00:01 (25.2 it/s) 00:02 (23.6 it/s) 00:01 (26.4 it/s)
As you can see, Flask development server is very fast independently of the amount of data transmitted (the "small" data is even a bit slower, probably because it wastes time allocating the memory on each of the 50 iterations), while both Waitress and GUnicorn get a significant hit on speed with more transmitted data.
Questions
At this point, I have a couple of questions:
- Do Waitress & GUnicorn run some kind of check on the submitted data that takes time? If so, is there a way to disable them?
- Is there an important reason why Waitress / GUnicorn are better than the Flask development server, or could I just use it for my use case? As mentioned:
- I don't care about security; these are visible only from localhost, and I generate the data that go into them in another process of mine
- I actively want one process/thread running at the same time, which is the only possibility for Flask development server and which I enforced for the others. This is because my real app will run on GPU, and if I have many processes / threads I'll quickly go out of memory
- I know that at any point in time there will be only a small number of connections to this server (probably 4, certainly not more than 8), so scaling is not an issue, either.
- ... but this would be in production, so I need something reliable and stable.
1024
to e.g.1024**2=1048576
, it then takes just 0.02s (still a bit slower than Flask, but an acceptable 20it/s). However this is not a solution as it dramatically slows things down when very little data is used; I guess some next step must then parse this buffer, and it's not optimized for receiving 4bytes + 1MB of nothing. – Vibraculum