Serve trained Tensorflow model with REST API using Flask?

Asked 8/4, 2016 at 6:50 Answered 22/8, 2018 at 3:9

python rest machine-learning tensorflow tensorflow-serving

I have got a trained Tensorflow model and I want to serve the prediction method with REST API. What I can think of is to use Flask to build a simple REST API that receive JSON as input and then call the predict method in Tensorflow and then return the predicted result to the client side.

I would like to know is there any concern to do it this way especially in production environment?

Many thanks!

Goodard answered 8/4, 2016 at 6:50 Comment(1)

Do you have any success? I look forward to. – Autoclave 24/4, 2016 at 5:7

The first concern which comes into my mind is the performance.

TensorFlow team seems to have worked out server/client usage. You may want to look into tensorflow serving. As a default, it uses gRPC for communication protocol.

Quacksalver answered 8/4, 2016 at 23:35 Comment(2)

I agree but do you know how much we suffer? Once we restore models and reuse them in the flask server, perhaps it won't hurt too much. – Autoclave 24/4, 2016 at 5:10

@SungKim Do you mean you actually prefer using flask? – Goodard 11/5, 2016 at 4:33

We use Flask + TensorFlow serving at work. Our setup might not be the most optimal way to serve models, but it gets the job done and it works fine for us so far.

The setup is the following:

Because tfserving takes forever to build, we built a docker image (not GPU support or anything, but it works for just serving a model and it's faster and better than serving it directly from within a huge Python/Flask monolite). The model server image can be found here: https://hub.docker.com/r/epigramai/model-server/
Then Flask is used to setup an API. In order to send requests to the model server we need a grcp prediction client, so we built one in Python that we can import directly into the flask API, https://github.com/epigramai/tfserving_predict_client/.

The good thing here is that the model is not served by the Flask API application. The docker image model server can easily be replaced with a model server running on a GPU compiled for the machines hardware instead of the docker container.

Luminescent answered 3/8, 2017 at 22:6 Comment(2)

can you say anything about the inference times you're seeing with this setup? (and also what overhead the flask API adds) – Poriferous 29/8, 2017 at 19:35

For our use case it works fine, but we're not getting tons of requests, so batching api requests before inference is not a "must have" in our use case. I would say that this setup, with the overhead in sending requests to the model server and so on, is about as fast as just loading the models in memory using tensorflow and flask in the same monolite. We find this setup useful because we can remove tensorflow complexity from the python flask app. We haven't done a lot of testing and compared inference time, the key advantage to us is the separation of concerns. – Luminescent 30/8, 2017 at 20:26

I think that one of your main concerns might be batching the requests. For example, let's say that your model is a trained CNN like VGG, Inception or similar. If you implement a regular web service with Flask, for each prediction request you receive (assuming you're running on GPU) you will do the prediction of a single image in the GPU, which can be suboptimal since you could batch similar requests, for example.

That's one of the things that TensorFlow Serving aims to offer, being able to combine requests for the same model/signature into a single batch before sending to GPU, being more efficient in the use of resources and (potentially) in throughput. You can find more information here: https://github.com/tensorflow/serving/tree/master/tensorflow_serving/batching

That said, it depends on the scenario very much. But batching of the predictions is something important to keep in mind.

Crowned answered 22/8, 2018 at 3:9 Comment(0)

Recommended topics

Hot tags