Tensorflow Serving Performance Very Slow vs Direct Inference

Asked 2/4, 2020 at 17:17 Answered 26/5, 2022 at 12:11

tensorflow kubernetes tensorflow-serving

I am running in the following scenario:

Single Node Kubernetes Cluster (1x i7-8700K, 1x RTX 2070, 32GB RAM)
1 Tensorflow Serving Pod
4 Inference Client Pods

What the inference clients do is they get images from 4 separate cameras (1 each) and pass it to TF-Serving for inference in order to get the understanding of what is seen on the video feeds.

I have previously been doing inference inside the Inference Client Pods individually by calling TensorFlow directly but that hasn't been good on the RAM of the graphics card. Tensorflow Serving has been introduced to the mix quite recently in order to optimize RAM as we don't load duplicated models to the graphics card.

And the performance is not looking good, for a 1080p images it looks like this:

Direct TF: 20ms for input tensor creation, 70ms for inference. TF-Serving: 80ms for GRPC serialization, 700-800ms for inference.

The TF-Serving pod is the only one that has access to the GPU and it is bound exclusively. Everything else operates on CPU.

Are there any performance tweaks I could do?

The model I'm running is Faster R-CNN Inception V2 from the TF Model Zoo.

Many thanks in advance!

Saxophone answered 2/4, 2020 at 17:17 Comment(1)

Is the comparison a CPU (direct) vs GPU (tf-s)? If so, your bottleneck could be the data transfer time to GPU. Often, CPU can perform single instance inference faster and whether or not you're using something like MKL is another factor that may help the cpu inference. – Brittbritta 5/9, 2020 at 21:59

This is from TF Serving documentation:

Please note, while the average latency of performing inference with TensorFlow Serving is usually not lower than using TensorFlow directly, where TensorFlow Serving shines is keeping the tail latency down for many clients querying many different models, all while efficiently utilizing the underlying hardware to maximize throughput.

From my own experience, I've found TF Serving to be useful in providing an abstraction over model serving which is consistent, and does not require implementing custom serving functionalities. Model versioning and multi-model which come out-of-the-box save you lots of time and are great additions.

Additionally, I would also recommend batching your requests if you haven't already. I would also suggest playing around with the TENSORFLOW_INTER_OP_PARALLELISM, TENSORFLOW_INTRA_OP_PARALLELISM, OMP_NUM_THREADS arguments to TF Serving. Here is an explanation of what they are

Polyurethane answered 6/1, 2021 at 17:55 Comment(0)

Maybe you could try OpenVINO? It's a heavily optimized toolkit for inference. You could utilize your i7-8700K and run some frames in parallel. Here are some performance benchmarks for very similar i7-8700T.

There is even OpenVINO Model Server which is very similar to Tensorflow Serving.

Disclaimer: I work on OpenVINO.

Amorphism answered 26/5, 2022 at 12:11 Comment(0)

Recommended topics

Hot tags