I'm trying to use UINT8 quantization while converting tensorflow model to tflite model:
If use post_training_quantize = True
, model size is x4 lower then original fp32 model, so I assume that model weights are uint8, but when I load model and get input type via interpreter_aligner.get_input_details()[0]['dtype']
it's float32. Outputs of the quantized model are about the same as original model.
converter = tf.contrib.lite.TFLiteConverter.from_frozen_graph(
graph_def_file='tflite-models/tf_model.pb',
input_arrays=input_node_names,
output_arrays=output_node_names)
converter.post_training_quantize = True
tflite_model = converter.convert()
Input/output of converted model:
print(interpreter_aligner.get_input_details())
print(interpreter_aligner.get_output_details())
[{'name': 'input_1_1', 'index': 47, 'shape': array([ 1, 128, 128, 3], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0)}]
[{'name': 'global_average_pooling2d_1_1/Mean', 'index': 45, 'shape': array([ 1, 156], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0)}]
Another option is to specify more parameters explicitly: Model size is x4 lower then original fp32 model, model input type is uint8, but model outputs are more like garbage.
converter = tf.contrib.lite.TFLiteConverter.from_frozen_graph(
graph_def_file='tflite-models/tf_model.pb',
input_arrays=input_node_names,
output_arrays=output_node_names)
converter.post_training_quantize = True
converter.inference_type = tf.contrib.lite.constants.QUANTIZED_UINT8
converter.quantized_input_stats = {input_node_names[0]: (0.0, 255.0)} # (mean, stddev)
converter.default_ranges_stats = (-100, +100)
tflite_model = converter.convert()
Input/output of converted model:
[{'name': 'input_1_1', 'index': 47, 'shape': array([ 1, 128, 128, 3], dtype=int32), 'dtype': <class 'numpy.uint8'>, 'quantization': (0.003921568859368563, 0)}]
[{'name': 'global_average_pooling2d_1_1/Mean', 'index': 45, 'shape': array([ 1, 156], dtype=int32), 'dtype': <class 'numpy.uint8'>, 'quantization': (0.7843137383460999, 128)}]
So my questions are:
- What is happenning when only
post_training_quantize = True
is set? i.e. why 1st case work fine, but second don't. - How to estimate mean, std and range parameters for second case?
- Looks like in second case model inference is faster, is it depend on the fact that model input is uint8?
- What means
'quantization': (0.0, 0)
in 1st case and'quantization': (0.003921568859368563, 0)
,'quantization': (0.7843137383460999, 128)
in 2nd case? - What is
converter.default_ranges_stats
?
Update:
Answer to question 4 is found What does 'quantization' mean in interpreter.get_input_details()?