How to extract feature vector from single image in Pytorch?
Asked Answered
F

3

6

I am attempting to understand more about computer vision models, and I'm trying to do some exploring of how they work. In an attempt to understand how to interpret feature vectors more I'm trying to use Pytorch to extract a feature vector. Below is my code that I've pieced together from various places.

import torch
import torch.nn as nn
import torchvision.models as models
import torchvision.transforms as transforms
from torch.autograd import Variable
from PIL import Image



img=Image.open("Documents/01235.png")

# Load the pretrained model
model = models.resnet18(pretrained=True)

# Use the model object to select the desired layer
layer = model._modules.get('avgpool')

# Set model to evaluation mode
model.eval()

transforms = torchvision.transforms.Compose([
        torchvision.transforms.Resize(256),
        torchvision.transforms.CenterCrop(224),
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
    
def get_vector(image_name):
    # Load the image with Pillow library
    img = Image.open("Documents/Documents/Driven Data Competitions/Hateful Memes Identification/data/01235.png")
    # Create a PyTorch Variable with the transformed image
    t_img = transforms(img)
    # Create a vector of zeros that will hold our feature vector
    # The 'avgpool' layer has an output size of 512
    my_embedding = torch.zeros(512)
    # Define a function that will copy the output of a layer
    def copy_data(m, i, o):
        my_embedding.copy_(o.data)
    # Attach that function to our selected layer
    h = layer.register_forward_hook(copy_data)
    # Run the model on our transformed image
    model(t_img)
    # Detach our copy function from the layer
    h.remove()
    # Return the feature vector
    return my_embedding

pic_vector = get_vector(img)

When I do this I get the following error:

RuntimeError: Expected 4-dimensional input for 4-dimensional weight [64, 3, 7, 7], but got 3-dimensional input of size [3, 224, 224] instead

I'm sure this is an elementary error, but I can't seem to figure out how to fix this. It was my impression that the "totensor" transformation would make my data 4-d, but it seems it's either not working correctly or I'm misunderstanding it. Appreciate any help or resources I can use to learn more about this!

Flannelette answered 23/8, 2020 at 21:11 Comment(0)
K
8

All the default nn.Modules in pytorch expect an additional batch dimension. If the input to a module is shape (B, ...) then the output will be (B, ...) as well (though the later dimensions may change depending on the layer). This behavior allows efficient inference on batches of B inputs simultaneously. To make your code conform you can just unsqueeze an additional unitary dimension onto the front of t_img tensor before sending it into your model to make it a (1, ...) tensor. You will also need to flatten the output of layer before storing it if you want to copy it into your one-dimensional my_embedding tensor.

A couple of other things:

  • You should infer within a torch.no_grad() context to avoid computing gradients since you won't be needing them (note that model.eval() just changes the behavior of certain layers like dropout and batch normalization, it doesn't disable construction of the computation graph, but torch.no_grad() does).

  • I assume this is just a copy paste issue but transforms is the name of an imported module as well as a global variable.

  • o.data is just returning a copy of o. In the old Variable interface (circa PyTorch 0.3.1 and earlier) this used to be necessary, but the Variable interface was deprecated way back in PyTorch 0.4.0 and no longer does anything useful; now its use just creates confusion. Unfortunately, many tutorials are still being written using this old and unnecessary interface.

Updated code is then as follows:

import torch
import torchvision
import torchvision.models as models
from PIL import Image

img = Image.open("Documents/01235.png")

# Load the pretrained model
model = models.resnet18(pretrained=True)

# Use the model object to select the desired layer
layer = model._modules.get('avgpool')

# Set model to evaluation mode
model.eval()

transforms = torchvision.transforms.Compose([
    torchvision.transforms.Resize(256),
    torchvision.transforms.CenterCrop(224),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])


def get_vector(image):
    # Create a PyTorch tensor with the transformed image
    t_img = transforms(image)
    # Create a vector of zeros that will hold our feature vector
    # The 'avgpool' layer has an output size of 512
    my_embedding = torch.zeros(512)

    # Define a function that will copy the output of a layer
    def copy_data(m, i, o):
        my_embedding.copy_(o.flatten())                 # <-- flatten

    # Attach that function to our selected layer
    h = layer.register_forward_hook(copy_data)
    # Run the model on our transformed image
    with torch.no_grad():                               # <-- no_grad context
        model(t_img.unsqueeze(0))                       # <-- unsqueeze
    # Detach our copy function from the layer
    h.remove()
    # Return the feature vector
    return my_embedding


pic_vector = get_vector(img)
Kneedeep answered 23/8, 2020 at 21:43 Comment(7)
Thanks so much for the great explanation. As a follow up I'm wondering how to best interpret the feature vector. The way I think about it is every value in that vector is a representation of some piece of information about the picture. Is there any way to know approximately what each of the 512 values is describing from the documentation of resnet? I can't seem to find anything. As a corollary off of this, if I ran this same procedure on two different images, does the 1st value in each resulting pic_vector correspond to the same feature? If this is too much for a follow up I understand, thanksFlannelette
This gets a little abstract, but the short answer is "no". The feature is an abstract representation of the input image in a 512 dimensional space. The primary characteristic of the feature space is that if you compare the features from images of the same types of objects they should be nearby one-another and different types of objects will be far away from one another. This characteristic is a result of the training objective of the network. Also, by "nearby" we usually mean that the cosine similarity is close to 1 and by "far" we mean the cosine similarity is not close to 1.Kneedeep
The individual values in the feature aren't usually so meaningful. For example, you could apply any rigid transformation without translation to your feature space and still have the exact same cosine similarity between features though this would completely change the individual values of your feature vectors. That said, there are ways to learn to distill specific information from feature vectors but this usually entails learning transforms on the features which correlate to different attributes, which requires additional training.Kneedeep
That's really helpful. Would it be accurate to say that the later the layer in the algorithm I extract the features from the more specific the descriptions are getting then? I'm roughly thinking of this as the difference between describing a picture as "contains people" to "contains 3 people of x race and y hair color" even though I know that's not exactly what it means. I'm assuming this is roughly the case for higher numbered resnet models as well? I really appreciate you taking the time to help me here.Flannelette
Kind of. A different way to think of it is levels of abstraction. For layers closer to the input will likely represent lower level concepts, like lines and edges, move up a layer or two and maybe it represents things like corners, go a bit further and you get features which represent building blocks like wheels and bricks and so forth. Finally when you get to the end the features represent the final objective, in this case object classes (e.g. "car", "cat", "person", etc...).Kneedeep
This is more of an interpretation than a hard and fast rule. There's nothing in the objective function explicitly requiring such representations are learned but research and experimentation has shown this is a reasonable interpretation of what's going on.Kneedeep
But you can also create a module and in the end torch.Size([512]) output. This would be equivalent. Right?Contralto
B
1

You can use create_feature_extractor from torchvision.models.feature_extraction to extract the required layer's features from the model.

The node name of the last hidden layer in ResNet18 is flatten which is basically flattened 1D avgpool. You can extract whatever layers you want by adding them in the return_nodes dict below.

from torchvision.io import read_image
from torchvision.models import resnet18, ResNet18_Weights
from torchvision.models.feature_extraction import create_feature_extractor

# Step 1: Initialize the model with the best available weights
weights = ResNet18_Weights.DEFAULT
model = resnet18(weights=weights)
model.eval()

# Step 2: Initialize the inference transforms
preprocess = weights.transforms()

# Step 3: Create the feature extractor with the required nodes
return_nodes = {'flatten': 'flatten'}
feature_extractor = create_feature_extractor(model, return_nodes=return_nodes)

# Step 4: Load the image(s) and apply inference preprocessing transforms
image = "?"
image = read_image(image).unsqueeze(0)
model_input = preprocess(image)

# Step 5: Extract the features
features = feature_extractor(model_input)
flatten_fts = features["flatten"].squeeze()
print(flatten_fts.shape)
Bubal answered 2/6, 2023 at 18:56 Comment(0)
S
0

model(t_img) Instead of this

Here just do--

model(t_img[None])

This will add an extra dimension, hence the image will be of shape [1,3,224,224] and it will work.

Steviestevy answered 22/8, 2021 at 8:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.