Understanding tf.contrib.lite.TFLiteConverter quantization parameters

Asked 22/2, 2019 at 16:0 Answered 25/9, 2019 at 10:33

Solved python tensorflow deep-learning tensorflow-lite quantization

I'm trying to use UINT8 quantization while converting tensorflow model to tflite model:

If use post_training_quantize = True, model size is x4 lower then original fp32 model, so I assume that model weights are uint8, but when I load model and get input type via interpreter_aligner.get_input_details()[0]['dtype'] it's float32. Outputs of the quantized model are about the same as original model.

converter = tf.contrib.lite.TFLiteConverter.from_frozen_graph(
        graph_def_file='tflite-models/tf_model.pb',
        input_arrays=input_node_names,
        output_arrays=output_node_names)
converter.post_training_quantize = True
tflite_model = converter.convert()

Input/output of converted model:

print(interpreter_aligner.get_input_details())
print(interpreter_aligner.get_output_details())
[{'name': 'input_1_1', 'index': 47, 'shape': array([  1, 128, 128,   3], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0)}]
[{'name': 'global_average_pooling2d_1_1/Mean', 'index': 45, 'shape': array([  1, 156], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0)}]

Another option is to specify more parameters explicitly: Model size is x4 lower then original fp32 model, model input type is uint8, but model outputs are more like garbage.

converter = tf.contrib.lite.TFLiteConverter.from_frozen_graph(
        graph_def_file='tflite-models/tf_model.pb',
        input_arrays=input_node_names,
        output_arrays=output_node_names)
converter.post_training_quantize = True
converter.inference_type = tf.contrib.lite.constants.QUANTIZED_UINT8
converter.quantized_input_stats = {input_node_names[0]: (0.0, 255.0)}  # (mean, stddev)
converter.default_ranges_stats = (-100, +100)
tflite_model = converter.convert()

Input/output of converted model:

[{'name': 'input_1_1', 'index': 47, 'shape': array([  1, 128, 128,   3], dtype=int32), 'dtype': <class 'numpy.uint8'>, 'quantization': (0.003921568859368563, 0)}]
[{'name': 'global_average_pooling2d_1_1/Mean', 'index': 45, 'shape': array([  1, 156], dtype=int32), 'dtype': <class 'numpy.uint8'>, 'quantization': (0.7843137383460999, 128)}]

So my questions are:

What is happenning when only post_training_quantize = True is set? i.e. why 1st case work fine, but second don't.
How to estimate mean, std and range parameters for second case?
Looks like in second case model inference is faster, is it depend on the fact that model input is uint8?
What means 'quantization': (0.0, 0) in 1st case and 'quantization': (0.003921568859368563, 0),'quantization': (0.7843137383460999, 128) in 2nd case?
What is converter.default_ranges_stats ?

Update:

Answer to question 4 is found What does 'quantization' mean in interpreter.get_input_details()?

Folks answered 22/2, 2019 at 16:0 Comment(2)

@suharshs Looks like you related to this part of tensorflow, can you eleborate on this? – Folks 25/2, 2019 at 9:19

4a. quantization is ignored for dtype of float32 – Merkle 21/4, 2019 at 12:30

What is happenning when only post_training_quantize = True is set? i.e. why 1st case work fine, but second don't.

In TF 1.14, this seems to just quantize the weights stored on disk, in the .tflite file. This does not, by itself, set the inference mode to quantized inference.

i.e., You can have a tflite model which has inference type float32 but the model weights are quantized (using post_training_quantize=True) for the sake of lower disk size, and faster loading of the model at runtime.

How to estimate mean, std and range parameters for second case?

The documentation is confusing to many. Let me explain what I concluded after some research :

Unfortunately quantization parameters/stats has 3 equivalent forms/representations across the TF library and documentation :
- A) (mean, std_dev)
- B) (zero_point, scale)
- C) (min,max)
Conversion from B) and A):
- std_dev = 1.0 / scale
- mean = zero_point
Conversion from C) to A):
- mean = 255.0*min / (min - max)
- std_dev = 255.0 / (max - min)
- Explanation: quantization stats are parameters used for mapping the range (0,255) to an arbitrary range, you can start from the 2 equations: min / std_dev + mean = 0 and max / std_dev + mean = 255, then follow the math to reach the above conversion formulas
Conversion from A) to C):
- min = - mean * std_dev
- max = (255 - mean) * std_dev
The naming "mean" and "std_dev" are confusing and are largely seen as misnomers.

To answer your question: , if your input image has :

range (0,255) then mean = 0, std_dev = 1
range (-1,1) then mean = 127.5, std_dev = 127.5
range (0,1) then mean = 0, std_dev = 255

Looks like in second case model inference is faster, is it depend on the fact that model input is uint8?

Yes, possibly. However quantized models are typically slower unless you make use of vectorized instructions of your specific hardware. TFLite is optimized to run those specialized instruction for ARM processors. As of TF 1.14 or 1.15 if you are running this on your local machine x86 Intel or AMD, then I'd be surprised if the quantized model runs faster. [Update: It's on TFLite's roadmap to add first-class support for x86 vectorized instructions to make quantized inference faster than float]

What means 'quantization': (0.0, 0) in 1st case and 'quantization': (0.003921568859368563, 0),'quantization': (0.7843137383460999, 128) in 2nd case?

Here this has the format is quantization: (scale, zero_point)

In your first case, you only activated post_training_quantize=True, and this doesn't make the model run quantized inference, so there is no need to transform the inputs or the outputs from float to uint8. Thus quantization stats here are essentially null, which is represented as (0,0).

In the second case, you activated quantized inference by providing inference_type = tf.contrib.lite.constants.QUANTIZED_UINT8. So you have quantization parameters for both input and output, which are needed to transform your float input to uint8 on the way in to the model, and the uint8 output to a float output on the way out.

At input, do the transformation: uint8_array = (float_array / std_dev) + mean
At output, do the transformation: float_array = (uint8_array.astype(np.float32) - mean) * std_dev
Note .astype(float32) this is necessary in python to get correct calculation
Note that other texts may use scale instead of std_dev so the divisions will become multiplications and vice versa.

Another confusing thing here is that, even though during conversion you specify quantization_stats = (mean, std_dev), the get_output_details will return quantization: (scale, zero_point), not just the form is different (scale vs std_dev) but also the order is different!

Now to understand these quantization parameter values you got for the input and output, let's use the formulas above to deduce the range of real values ((min,max)) of your inputs and outputs. Using the above formulas we get :

Input range : min = 0, max=1 (it is you who specified this by providing quantized_input_stats = {input_node_names[0]: (0.0, 255.0)} # (mean, stddev) )
Output range: min = -100.39, max=99.6

Levalloisian answered 25/9, 2019 at 10:33 Comment(10)

Is converter.default_ranges_stats corresponds to min, max in your answer? – Folks 25/9, 2019 at 11:1

Also following your equations if input image range [0,1] then mean = 0, std_dev = 255, then min = - mean * std_dev, max = (255 - mean) * std_dev -> min=0, max=255*255, so where is min = -100.39, max=99.6 come from? – Folks 25/9, 2019 at 11:15

Yes default_ranges_stats specify (min,max) – Levalloisian 25/9, 2019 at 11:46

For your output you have get_output_details() returning 'quantization': (0.7843137383460999, 128) So min = -128 * 0.7434313 = -100.39, max = (255-128)*0.7434313 = 99.6. Let me know if you spot a mistake – Levalloisian 25/9, 2019 at 11:49

@Folks also remember to Accept the best answer – Levalloisian 25/9, 2019 at 16:41

I still don't get it. Is default_ranges_stats influence values that are shown for output layer 'quantization': (0.7843137383460999, 128), i.e. does it mean that values min = -128 * 0.7434313 = -100.39, max = (255-128)*0.7434313 = 99.6 are close to (-100, +100)? based on what default_ranges_stats should be set? Should I estimate all possible outputs of network on my dataset and set default_ranges_stats range based on that? – Folks 25/9, 2019 at 18:3

Yes, default_range_stats influences the quantization params of not just the output, but every tensor in your graph. Note that you should never define default_range_stats yourself. This option is available only for debugging. The right way is to quantize-aware-train your model, so that each tensor have it's own min/max based on the training data : github.com/tensorflow/tensorflow/tree/r1.13/tensorflow/contrib/… – Levalloisian 25/9, 2019 at 22:13

To be precise, it does not just "influences"...default_range_stats IS THE quantization parameters, specified in (min,max) format, but then stored in the .tflite file in the (scale, zero_point) format, which is what you see in get_output_details() – Levalloisian 25/9, 2019 at 22:40

So with post_training_quantize = True(without params) I get model with uint8 weights and inference is in float32, but only way to get model with uint8 weight and uint8 inference is quantization-aware training ? Is there any reason to specify additional params of converter when post_training_quantize = True is used? – Folks 26/9, 2019 at 9:10

Let us continue this discussion in chat. – Levalloisian 26/9, 2019 at 9:31

1) See documantation. In short, this technique allows you to get a quantized uint8 graph with an accuracy of work that is close to the original one and does not require further training of the quantized model. However, the speed is noticeably less than if conventional quantization were used.

2) If your model has trained with normalized [-1.0, 1.0] input you should set converter.quantized_input_stats = {input_node_names[0]: (128, 127)}, and after that quantization of input tensor will be close to (0.003921568859368563, 0). mean is the integer value from 0 to 255 that maps to floating point 0.0f. std_dev is 255 / (float_max - float_min). This will fix one possible problem

3) Uint8 neural network inference is about 2 times faster (based on device), then float32 inference

Attention answered 28/2, 2019 at 4:14 Comment(3)

For 2) did you mean image preprocessing? I use bgr image as input divided by 255.0, so my input in range [0,1], so as I understand in my case it will be mean=0.0 and std_dev=255.0 What about converter.default_ranges_stats ? – Folks 28/2, 2019 at 9:58

Yes, it depends on image preprocessing. About default_ranges_stats. In general to create quantized tflite graph all tensors have to have min/max information of possible values. This information will be used to create quantization parameters: scale and zero_point. If this minmax values are missing, then it will be used minmax from default_ranges_stats, in this case this mean, that quantized graph inference will be garbage like – Attention 1/3, 2019 at 10:37

Is there some way to see that a TFLite model did pass post_training_quantize? In my test, running //tensorflow/lite/tools:visualize on both models gives same results (not identical, buffer indices are different). Also, timecost to run inference for the two models (in CPU) does not statistically change. – Merkle 21/4, 2019 at 16:34

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags