I am confused about whether it is possible to run an int8 quantized model on CUDA, or can you only train a quantized model on CUDA with fakequantise for deployment on another backend such as a CPU.
I want to run the model on CUDA with actual int8 instructions instead of FakeQuantised float32 instructions, and enjoy the efficiency gains. Pytorch docs are strangely nonspecific about this. If it is possible to run a quantized model on CUDA with a different framework such as TensorFlow
I would love to know.
This is the code to prep my quantized model (using post-training quantization). The model is normal CNN with nn.Conv2d and nn.LeakyRelu and nn.MaxPool modules:
model_fp = torch.load(models_dir+net_file)
model_to_quant = copy.deepcopy(model_fp)
model_to_quant.eval()
model_to_quant = quantize_fx.fuse_fx(model_to_quant)
qconfig_dict = {"": torch.quantization.get_default_qconfig('qnnpack')}
model_prepped = quantize_fx.prepare_fx(model_to_quant, qconfig_dict)
model_prepped.eval()
model_prepped.to(device='cuda:0')
train_data = ImageDataset(img_dir, train_data_csv, 'cuda:0')
train_loader = DataLoader(train_data, batch_size=32, shuffle=True, pin_memory=True)
for i, (input, _) in enumerate(train_loader):
if i > 1: break
print('batch', i+1, end='\r')
input = input.to('cuda:0')
model_prepped(input)
This actually quantizes the model:
model_quantised = quantize_fx.convert_fx(model_prepped)
model_quantised.eval()
This is an attempt to run the quantized model on CUDA, and raises a NotImplementedError, when I run it on CPU it works fine:
model_quantised = model_quantised.to('cuda:0')
for i, _ in train_loader:
input = input.to('cuda:0')
out = model_quantised(input)
print(out, out.shape)
break
This is the error:
Traceback (most recent call last):
File "/home/adam/Desktop/thesis/Ship Detector/quantisation.py", line 54, in <module>
out = model_quantised(input)
File "/home/adam/.local/lib/python3.9/site-packages/torch/fx/graph_module.py", line 513, in wrapped_call
raise e.with_traceback(None)
NotImplementedError: Could not run 'quantized::conv2d.new' with arguments from the 'QuantizedCUDA' backend.
This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build).
If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions.
'quantized::conv2d.new' is only available for these backends: [QuantizedCPU, BackendSelect, Named, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, UNKNOWN_TENSOR_TYPE_ID, AutogradMLC, Tracer, Autocast, Batched, VmapMode].