Tensorflow Serving: When to use it rather than simple inference inside Flask service?

Asked 30/1, 2018 at 17:28 Answered 23/8, 2018 at 16:8

python tensorflow flask tensorflow-serving

I am serving a model trained using object detection API. Here is how I did it:

Create a Tensorflow service on port 9000 as described in the basic tutorial
Create a python code calling this service using predict_pb2 from tensorflow_serving.apis similar to this
Call this code inside a Flask server to make the service available with HTTP

Still, I could have done things much easier the following way :

Create a python code for inference like in the example in object detection repo
Call this code inside a Flask server to make the service available with HTTP

As you can see, I could have skipped the use of Tensorflow serving.

So, is there any good reason to use Tensorflow serving in my case ? If not, what are the cases where I should use it ?

Boak answered 30/1, 2018 at 17:28 Comment(0)

I believe most of the reasons why you would prefer Tensorflow Serving over Flask are related to performance:

Tensorflow Serving makes use of gRPC and Protobuf while a regular Flask web service uses REST and JSON. JSON relies on HTTP 1.1 while gRPC uses HTTP/2 (there are important differences). In addition, Protobuf is a binary format used to serialize data and it is more efficient than JSON.
TensorFlow Serving can batch requests to the same model, which uses hardware (e.g. GPUs) more appropriate.
TensorFlow Serving can manage model versioning

As almost everything, it depends a lot on the use case you have and your scenario, so it's important to think about pros and cons and your requirements. TensorFlow Serving has great features, but these features could be also implemented to work with Flask with some effort (for instance, you could create your batch mechanism).

Butterandeggs answered 22/8, 2018 at 3:23 Comment(0)

Flask is used to handle request/response whereas Tensorflow serving is particularly built for serving flexible ML models in production.

Let's take some scenarios where you want to:

Serve multiple models to multiple products (Many to Many relations) at the same time.
Look which model is making an impact on your product (A/B Testing).
Update model weights in production, which is as easy as saving a new model to a folder.
Have a performance equal to code written in C/C++.

And you can always use all those advantages for FREE by sending requests to TF Serving using Flask.

Raine answered 23/8, 2018 at 16:8 Comment(2)

Sorry for the delay and thank you for your answer. For point 2, do you mean that you can gain time with memory management when switching models ? For point 4, when you just run sess.run(...), aren't your performances already pretty close to C/C++ ? – Rohr 1/1, 2019 at 14:18

Tensorflow serving now supports RESTful API you need not use predict_pb2 to make predictions. In point 2, what I mean is test the new model performance to answer questions like "How is my new model performing compared to old one". For point 4, probably yes, but IMHO finely built c++ inference code inside serving should be performing better than we running sess.run() inside flask (when we have high requests rate like 100 requests/second). – Raine 2/1, 2019 at 17:14

Recommended topics

Hot tags