Running Pytorch Quantized Model on CUDA GPU
Asked Answered
D

2

9

I am confused about whether it is possible to run an int8 quantized model on CUDA, or can you only train a quantized model on CUDA with fakequantise for deployment on another backend such as a CPU.

I want to run the model on CUDA with actual int8 instructions instead of FakeQuantised float32 instructions, and enjoy the efficiency gains. Pytorch docs are strangely nonspecific about this. If it is possible to run a quantized model on CUDA with a different framework such as TensorFlow I would love to know.

This is the code to prep my quantized model (using post-training quantization). The model is normal CNN with nn.Conv2d and nn.LeakyRelu and nn.MaxPool modules:

model_fp = torch.load(models_dir+net_file)

model_to_quant = copy.deepcopy(model_fp)
model_to_quant.eval()
model_to_quant = quantize_fx.fuse_fx(model_to_quant)

qconfig_dict = {"": torch.quantization.get_default_qconfig('qnnpack')}

model_prepped = quantize_fx.prepare_fx(model_to_quant, qconfig_dict)
model_prepped.eval()
model_prepped.to(device='cuda:0')

train_data   = ImageDataset(img_dir, train_data_csv, 'cuda:0')
train_loader = DataLoader(train_data, batch_size=32, shuffle=True, pin_memory=True)

for i, (input, _) in enumerate(train_loader):
    if i > 1: break
    print('batch', i+1, end='\r')
    input = input.to('cuda:0')
    model_prepped(input)

This actually quantizes the model:

model_quantised = quantize_fx.convert_fx(model_prepped)
model_quantised.eval()

This is an attempt to run the quantized model on CUDA, and raises a NotImplementedError, when I run it on CPU it works fine:

model_quantised = model_quantised.to('cuda:0')
for i, _ in train_loader:
    input = input.to('cuda:0')
    out = model_quantised(input)
    print(out, out.shape)
    break

This is the error:

Traceback (most recent call last):
  File "/home/adam/Desktop/thesis/Ship Detector/quantisation.py", line 54, in <module>
    out = model_quantised(input)
  File "/home/adam/.local/lib/python3.9/site-packages/torch/fx/graph_module.py", line 513, in wrapped_call
    raise e.with_traceback(None)
NotImplementedError: Could not run 'quantized::conv2d.new' with arguments from the 'QuantizedCUDA' backend. 
This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). 
If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 
'quantized::conv2d.new' is only available for these backends: [QuantizedCPU, BackendSelect, Named, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, UNKNOWN_TENSOR_TYPE_ID, AutogradMLC, Tracer, Autocast, Batched, VmapMode].
Dormer answered 26/10, 2021 at 6:30 Comment(0)
A
5

From [this][1] blog, it looks like you cannot run quantized models on GPU.

Quantization in PyTorch is currently CPU-only. Quantization is not a CPU-specific technique (e.g. NVIDIA's TensorRT can be used to implement quantization on GPU). However, inference time on GPU is already usually "fast enough", and CPUs are more attractive for large-scale model server deployment (due to complex cost factors that are out of the scope of this article). Consequently, as of PyTorch 1.6, only CPU backends are available in the native API.

[1]: https://spell.ml/blog/pytorch-quantization-X8e7wBAAACIAHPhT#:~:text=Quantization%20in%20PyTorch%20is%20currently,to%20implement%20quantization%20on%20GPU).

Aragon answered 23/9, 2022 at 7:7 Comment(0)
R
0

This is a outdated question, but things have progressed quite well and there are significant strides towards with running performant quantized models with GPU backend.

Ricardoricca answered 18/5 at 12:6 Comment(1)
This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From ReviewEndanger

© 2022 - 2024 — McMap. All rights reserved.