Can multiple tensorflow inferences run on one GPU in parallel?
Asked Answered
F

2

10

I am trying to run Tensorflow as a serve on one NVIDIA Tesla V100 GPU. As a server, my program need to accept multiple requests concurrently. So, my questions are the following:

  1. When multiple requests arrive at the same time, suppose we are not using batching, are these requests run on the GPU sequentially or in parallel? I understand independent processes have seperate CUDA contexts, which are run sequentially on the GPU. But these requests are actually different threads in the same process and should share one CUDA context. So according to the documentation, the GPU can run multiple kernels concurrently. If this is the true, does it mean if I have a large amount of requests arrive at the same time, the GPU utilization can go up to 100%? But this never happen in my experiment.

  2. What is the difference between running one session in different threads vs. running different sessions in different threads? Which is the proper way to implement a Tensorflow server? Which one does Tensorflow Serving use?

Any advice will be appreciated. Thank you!

Fisk answered 29/4, 2019 at 16:19 Comment(0)
B
2

Regarding #1: all requests will be run on the same GPU sequentially, since TF uses a global single compute stream for each physical GPU device (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/gpu/gpu_device.cc#L284)

Regarding #2: in terms of multi-streaming, the two options are similar: by default multi-streaming is not enabled. If you want to experiment with multi-streams, you may try the virtual_device option (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/config.proto#L138)

Thanks.

Beckon answered 26/8, 2019 at 19:4 Comment(1)
I'm a bit unclear as to what you mean by "multi-streaming" here. Do you mean that each logical device will get its own stream? This appears to contradict your answer to part #1, if I understand correctly, where you say "TF uses a global single compute stream for each physical GPU device".Conjoin
M
0

For model inference, you may want to look at high performance inference engines like nvidia triton. It allows multiple model instances, each of which has dedicated cuda streams where GPU can exploit more parallelism.

See https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/architecture.html#concurrent-model-execution

Marilla answered 20/8, 2020 at 0:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.