What is the difference between an Embedding Layer with a bias immediately afterwards and a Linear Layer in PyTorch

Asked 25/12, 2020 at 4:2 Answered 10/3 at 16:55

Solved python oop deep-learning pytorch fast-ai

I am reading the "Deep Learning for Coders with fastai & PyTorch" book. I'm still a bit confused as to what the Embedding module does. It seems like a short and simple network, except I can't seem to wrap my head around what Embedding does differently than Linear without a bias. I know it does some faster computational version of a dot product where one of the matrices is a one-hot encoded matrix and the other is the embedding matrix. It does this to in effect select a piece of data? Please point out where I am wrong. Here is one of the simple networks shown in the book.

class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return (users * movies).sum(dim=1)

Marentic answered 25/12, 2020 at 4:2 Comment(0)

Embedding

[...] what Embedding does differently than Linear without a bias.

Essentially everything. torch.nn.Embedding is a lookup table; it works the same as torch.Tensor but with a few twists (like possibility to use sparse embedding or default value at specified index).

For example:

import torch

embedding = torch.nn.Embedding(3, 4)

print(embedding.weight)

print(embedding(torch.tensor([1])))

Would output:

Parameter containing:
tensor([[ 0.1420, -0.1886,  0.6524,  0.3079],
        [ 0.2620,  0.4661,  0.7936, -1.6946],
        [ 0.0931,  0.3512,  0.3210, -0.5828]], requires_grad=True)
tensor([[ 0.2620,  0.4661,  0.7936, -1.6946]], grad_fn=<EmbeddingBackward>)

So we took the first row of the embedding. It does nothing more than that.

Where is it used?

Usually when we want to encode some meaning (like word2vec) for each row (e.g. words being close semantically are close in euclidean space) and possibly train them.

Linear

torch.nn.Linear (without bias) is also a torch.Tensor (weight) but it does operation on it (and the input) which is essentially:

output = input.matmul(weight.t())

every time you call the layer (see source code and functional definition of this layer).

Code snippet

The layer in your code snippet does this:

creates two lookup tables in __init__
the layer is called with input of shape (batch_size, 2):
- first column contains indices of user embeddings
- second column contains indices of movie embeddings
these embeddings are multiplied and summed returning (batch_size,) (so it's different from nn.Linear which would return (batch_size, out_features) and perform dot product instead of element-wise multiplication followed by summation like here)

This is probably used to train both representations (of users and movies) for some recommender-like system.

Other stuff

I know it does some faster computational version of a dot product where one of the matrices is a one-hot encoded matrix and the other is the embedding matrix.

No, it doesn't. torch.nn.Embedding can be one hot encoded and might also be sparse, but depending on the algorithms (and whether those support sparsity) there might be performance boost or not.

Czech answered 25/12, 2020 at 14:0 Comment(0)

TL;DR

nn.Embedding is for categorical input.
nn.Linear is for ordinal input.

Explanation

You use nn.Embedding when dealing with categorical data, e.g., class labels (0, 1, 2, ...). Because in a lookup table, the value would not be proportional to the key. This behavior suits categorical data, whose value has nothing to do with the semantics.

On the other hand, nn.Linear, being a matrix multiplication, does not provide the aforementioned behavior. The input and output are proportional due to the natural of multiplication. Therefore, you use nn.Linear for ordinal data.

Pause answered 20/3, 2023 at 12:45 Comment(0)

What is the difference between an Embedding Layer with a bias immediately afterwards and a Linear Layer in PyTorch

There is no difference and we can prove this as follows:

Consider a (m,n) linear layer without bias. Such layer is equivalent to the function y=Wx where x is a column vector with dimensions (n,1) and y is a column vector with dimensions (m,1).

You can easily show that the result from this takes the form y[i] = dot(A[i,:],x) which means that if x is one-hot vector with a one at the kth index then y[i]= A[i,k] which implies that y=A[:,k]. This shows that multiplying by a one-hot with a one at the kth index vector extracts the kth column from the weight matrix (or the kth row for y=x^TW).

This implies there are (at least) two ways to implement an embedding layer:

Assume x is a one-hot vector, perform the matrix multiplication (i.e., use a linear layer without bias)
Assume x is the index (k) where the one-hot vector is 1 and directly return the kth column/row.

You learn the same weight matrix whether you implement is via (1) or (2) but one of these seems more efficient.

Loculus answered 10/3 at 16:55 Comment(0)