Problem description
The problem we encounter is the following. Serving is configured to load and serve 7 models, and with an increase in the number of models, Serving requests timeout more frequently. On the contrary, with a decrease in the number of models request timeouts are insignificant. From the client's side, timeout was configured to 5 seconds.
Interestedly, the maximum batch processing duration is approximately 700ms, with a configured maximum batch size of 10. The average batch processing duration is ~60ms.
Logs and screenshots
We've checked the TensorFlow Serving logs but no warnings nor errors were found. In addition to, we've monitored the network of the running GPU machines and hosts executing inference requests towards Serving, but no network issues were identified neither.
Temporally solution
Decreasing the number of loaded and served models, however not the expected solution because this requires setting up multiple distinct GPU instance each loading and serving only a subset of models.
System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow Serving installed from (source or binary): source
TensorFlow Serving version: 1.9
TensorFlow serving runs on multiple AWS g2.2xlarge instances. We run TensorFlow Serving using Docker, with a base image nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
What could be the route cause of such a behaviour? How is Serving expected to handle requests when having multiple models loaded in-memory? How does it change the model context?