Very slow performance on extracting tensorflow-serving grpc request results using .float_val
Asked Answered
S

0

6

For some reason the time used to extract results using .float_val is extremely high.

Scenario example along with its output:

t2 = time.time()
options = [('grpc.max_receive_message_length', 100 * 4000 * 4000)]
channel = grpc.insecure_channel('{host}:{port}'.format(host='localhost', port=str(self.serving_grpc_port)), options = options)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = 'ivi-detector'
request.model_spec.signature_name = 'serving_default'

request.inputs['inputs'].CopyFrom(tf.make_tensor_proto(imgs_array, shape=imgs_array.shape))
res = stub.Predict(request, 100.0)

print("Time to detect:")
t3 = time.time(); print("t3:", t3 - t2)

t11 = time.time()
boxes_float_val = res.outputs['detection_boxes'].float_val
t12 = time.time(); print("t12:", t12 - t11)
classes_float_val = res.outputs['detection_classes'].float_val
t13 = time.time(); print("t13:", t13 - t12)
scores_float_val = res.outputs['detection_scores'].float_val
t14 = time.time(); print("t14:", t14 - t13)

boxes = np.reshape(boxes_float_val, [len(imgs_array), self.max_total_detections,4])
classes = np.reshape(classes_float_val, [len(imgs_array), self.max_total_detections])
scores = np.reshape(scores_float_val, [len(imgs_array), self.max_total_detections])
t15 = time.time(); print("t15:", t15 - t14)
Time to detect:
t3: 1.4687104225158691
t12: 1.9140026569366455
t13: 3.719329833984375e-05
t14: 9.298324584960938e-06
t15: 0.0008063316345214844

Tensorflow Serving is running an object detection model from tensorflow's object detection api (faster_rncc_resnet101). As we can see, the extraction of the boxes found on detection is higher than the prediction itself.

The current shape of the detected boxes is [batch_size, 100, 4], with 100 being the number of max detections. As a workaround I can low the number of max detection and decrease significantly the necessary time to extract these values, but it keeps staying unnecessary (on my point of view) high.

I'm using tensorflow-serving 2.3.0-gpu as a docker container along with tensorflow-serving-api==2.3.0

Also, it's important to inform that I tried to reproduce this behaviour on a public saved model (purely trained on imagenet) and the slow performance on .float_val didn't happen, pointing that the problem can be specifically with my custom trained model. I already tried to export the saved model from .ckpt files in different ways but the problem still occurs and, if I use any of the export methods for the downloaded model (the downloaded model comes with both .ckpt files and saved_model format files) the problem doesn't occur, so the export methods are safe.

Now I'm suspecting that something is wrong/different with the model I trained.... but.. why? Does it makes sense that it affects .float_val from tensorflow-serving-api?

The code I used (with fast results): https://github.com/denisb411/tfserving-od/blob/master/inference-using-tfserving-docker.ipynb

I don't know how to proceed as my custom training follows almost the same pipeline.config as the original so, there's nothing different on the training process.

How can I manage to fix this? How is this related with .float_val, if there's any relation?

Assuming that this is a bug, a time ago I created a github issue talking about this problem I got into but it didn't get enough attention.

Sparid answered 20/10, 2020 at 1:21 Comment(7)
I do not know this specific API that you're using, but just looking at your problem, isn't is possible that you see the slowdown because of some lazy initialization/computation? Either there is some unreasonably large property of res (maybe inspect in debugger?) or perhaps the model predict is actually only called once you request it's first return value. Also, with local TF use, I always see a slowdown when asking for first prediction after initialization. To prove me wrong, you can simply change the order of calling the float_val values and confirm that the timings are still the same.Geranium
@Geranium Indeed the problem wasn't particularly with the boxes variable but with any first .float_val that I call. Reproduced this. Still the question: Why the public model works normally (fast unpack of predicted values) but not my custom one? How the number of max detections affects this? Because we know it affects as lowering this number the predictions are unpacked faster.Sparid
I would inspect two things. 1. If you run a second prediction right after the first one in a single script, do you see similar timings on both? If not, this would suggest that it does not really have to do with the returned data (I know this is consistent with your scaling but still, I'd test this). 2. If you confirm it's the data, I would inspect the result object in debugger to see if there are any properties that are huge. Because then the delay could perhaphs be caused by network bandwidth bottleneck. What is the shape of your outputs by the way? (Those could possibly be the culprit)Geranium
I reproduced the second predict request (without instantiating a new predict_pb2.PredictRequest()) and the exactly same problem occurs. How do you suggest I debug? And what output are you referring? The response from prediction or the data I want to extract (ie. boxes, classes and scores)?Sparid
I am referring to the shape of the outputs of the model which you're using for inference. Well, I suggest you use a debugger, like the one in pycharm, you can set a breakpoint before the end of the script and you can inspect the res.outputs after the first float_val call to see what are the contents of the object, looking for anything that consumes a lot of memory. Do you know the bandwidth of the server connection? Also have you tried different values for 'grpc.max_receive_message_length'? Is it possible it is too small/too large?Geranium
In this particular case the output of boxes is [10, 16, 4], classes is [10, 16], same as scores. The ´grpc.max_receive_message_length´ option value doesn't affect the root problem, nor using an different shape for the input (resized the images from 1920x1080 to 960x540).. the problem persisted. I managed to reproduce the fault on vscode debugger but I don't have experience with this tool thus I don't know what to check/report. Can you help me with this?Sparid
As I already mentioned, I'd look for properties of the result object that might be too large, you don't need a debugger for that, it just makes it easier. How large is the model you're using? Is it comparable to the public one you're mentioning? Are you running the model on a GPU? I'd still be suspicious that something asynchronous/lazy is going on and only once you call float_val the thread is blocked and waits for the server. Is it possible that running the inference on the server takes this long? Other than that I'm out of my depth, the problem might also be in some other part of your code.Geranium

© 2022 - 2025 — McMap. All rights reserved.