Prunning model doesn't improve inference speed or reduce model size

Asked 11/6, 2020 at 14:27 Answered 20/7, 2023 at 4:37

python machine-learning pytorch torchvision torchtext

I'm trying to prune my model in PyTorch with torch.nn.utils.prune, which provides 2 tensors,

one is the original weight and
the other is a mask contain 0s and 1s that help us close certain connections in the network.

I have tried both of the solutions, but none improve the inference speed:

Use the network after pruning to infer which will first close some connections with the mask and then run inference.
Zeros out the original weights with the mask and then remove the mask from the state_dict to infer.

Is there a way to improve the speed with the model tensor and the mask? Doesn't multiply with a non-zero float number with 0 will faster than multiply 2 floats with each other?
Here is my prune function and the pruning speed calculating procedure:

def prune_net(net):
    """Prune 20% net's weights that have abs(value) approx. 0
    Function that will be use when an iteration is reach
    Args:

    Return:
        newnet (nn.Module): a newnet contain mask that help prune network's weight
    """
    if not isinstance(net,nn.Module):
        print('Invalid input. Must be nn.Module')
        return
    newnet = copy.copy(net)
    modules_list = []

    for name, module in newnet.named_modules():
        if isinstance(module, torch.nn.Conv2d):
            modules_list += [(module,'weight'),(module,'bias')]
        if isinstance(module, torch.nn.Linear):
            modules_list += [(module,'weight'),(module,'bias')]

    prune.global_unstructured(
        modules_list,
        pruning_method=prune.L1Unstructured,
        amount=0.2,)
    return newnet

Test inference speed 1st case:

import torch
from torch import nn
import torch.nn.utils.prune as prune
import torch.nn.functional as F
import time
from torch.autograd import Variable


torch.set_default_tensor_type('torch.cuda.FloatTensor')
old_net = init_your_net()

new_net = prune_net(old_net)
new_net = prune_net(new_net)

old_net.eval()
new_net.eval()

old_net = old_net.cuda()
new_net = new_net.cuda()
dataset = load_your_dataset()

for i in range(100):
    x = dataset[i]
    x = x.cuda()
    y = x.cuda()

    #new infer
    start_time = time.perf_counter()
    detections = new_net(x).data
    time_new += time.perf_counter() - start_time

    #old infer
    start_time = time.perf_counter()
    detections = old_net(y).data
    time_old += time.perf_counter() - start_time
print('old ',time_old)
print('new ', time_new)

Test inference speed 2nd case:

import torch
from torch import nn
import torch.nn.utils.prune as prune
import torch.nn.functional as F
import time
from torch.autograd import Variable


torch.set_default_tensor_type('torch.cuda.FloatTensor')
old_net = init_your_net()

new_net = prune_net(old_net)
new_net = prune_net(new_net)
# Apply mask to model tensor and remove mask from state_dict
for name, module in new_net.named_modules():
    if isinstance(module, torch.nn.Conv2d):
        prune.remove(module,'weight')
        prune.remove(module,'bias')
    if isinstance(module, torch.nn.Linear):
        prune.remove(module,'weight')
        prune.remove(module,'bias')

old_net.eval()
new_net.eval()

old_net = old_net.cuda()
new_net = new_net.cuda()
dataset = load_your_dataset()

for i in range(100):
    x = dataset[i]
    x = x.cuda()
    y = x.cuda()

    #new infer
    start_time = time.perf_counter()
    detections = new_net(x).data
    time_new += time.perf_counter() - start_time

    #old infer
    start_time = time.perf_counter()
    detections = old_net(y).data
    time_old += time.perf_counter() - start_time
print('old ',time_old)
print('new ', time_new)

UPDATE
I found torch have a sparse module that can reduce memory usage if we prune enough parameter but it hasn't support nn.Module yet, only Tensor object. Here are some useful link:
https://github.com/pytorch/pytorch/issues/36214#issuecomment-619586452
https://pytorch.org/docs/stable/sparse.html

Sianna answered 11/6, 2020 at 14:27 Comment(0)

It is important to understand the difference between unstructured pruning and structured pruning.

Structured pruning: the dimensions of the weight tensors are reduced by removing entire rows/columns of the tensors. This translates into removing neurons with all their incoming and outgoing connections (in dense layers) or entire convolutional filters (in convolutional layers).
Unstructured pruning: individual weights can be "removed" (zeroed-out) without constraints of the shape of the final tensor. This translates into removing individual connections between neurons (in dense layers) or removing individual weights of the convolutional filters (in convolutional layers). Notice that the resulting weight tensors can be sparse but maintain their original shape.

Currently, torch.nn.utils.prune only supports unstructured pruning, which hardly helps to reduce the inference cost because GPUs are not optimized for sparse matrix multiplications. While you might want to reduce the dimensions of your weight tensors to reduce the number of floating-point operations, unstructured pruning produces weight tensors with many zeros but does not automatically reduce the size of such tensors.

Unstructured pruning can help improve the performance only when a lot of weights are removed. In this case, you can either rely on PyTorch sparse operations or try to find rows/columns that contain all zeros and thus can be removed.

Instead, if you want to look into structured pruning, you can take a look at TorchPruner, a library that I have developed myself for research purposes and that provides utilities to find the least important neurons and slice the weight tensors accordingly.

Hangnail answered 9/8, 2020 at 12:5 Comment(0)

I am also trying pruning to increase the inference speed. But what I found more useful is the use of ONNX and ONNXRuntime instead. Here is the link with all the steps:

https://pytorch.org/tutorials/advanced/super_resolution_with_onnxruntime.html

It will reduce the time up to 85% without accuracy loss.

Legislator answered 30/6, 2020 at 23:41 Comment(3)

How would it do that? Isn't that just an inference server? Are you talking about caching web requests and inferencing them over a batch? – Shayne 13/10, 2020 at 20:10

@AkshayRana I applied PyTorch Lighning's ModelPruning on a project of mine, and found the inference speed is identical (within 1 standard deviation) for models with 0, 35, and 50 percent sparsity. I've read that speed improvements from pruning should only be expected if you're able to zero-out entire rows/columns of matrices – Lubet 10/6, 2021 at 14:45

@Legislator Could you clarify the model architecture and pruning approach that allowed you to achieve 85% speedup - did you use channel-wise or some other structured pruning? – Lubet 10/6, 2021 at 14:47

Pruning weights by setting them to 0.0 is just half the story. The other half is to remove them from your model (physically) so that they aren't involved in any computations. This isn't trivially possible with unstructured pruning, but with structured pruning, if you are able to remove specific output channels from your conv operations (and hence the input channels from the next conv), then you can get some speedup in your model.

To physically remove the channels, once you have identified which channels to remove, you need to adjust the learnable weight parameter on your Conv2d module. You may also need to change a few layers after it, specifically the batch norm layer and the next conv or linear layer.

Michellemichels answered 20/7, 2023 at 4:37 Comment(0)

Recommended topics

Hot tags