Embedding
[...] what Embedding does differently than Linear without a bias.
Essentially everything. torch.nn.Embedding
is a lookup table; it works the same as torch.Tensor
but with a few twists (like possibility to use sparse embedding or default value at specified index).
For example:
import torch
embedding = torch.nn.Embedding(3, 4)
print(embedding.weight)
print(embedding(torch.tensor([1])))
Would output:
Parameter containing:
tensor([[ 0.1420, -0.1886, 0.6524, 0.3079],
[ 0.2620, 0.4661, 0.7936, -1.6946],
[ 0.0931, 0.3512, 0.3210, -0.5828]], requires_grad=True)
tensor([[ 0.2620, 0.4661, 0.7936, -1.6946]], grad_fn=<EmbeddingBackward>)
So we took the first row of the embedding. It does nothing more than that.
Where is it used?
Usually when we want to encode some meaning (like word2vec) for each row (e.g. words being close semantically are close in euclidean space) and possibly train them.
Linear
torch.nn.Linear
(without bias) is also a torch.Tensor
(weight) but it does operation on it (and the input) which is essentially:
output = input.matmul(weight.t())
every time you call the layer (see source code and functional definition of this layer).
Code snippet
The layer in your code snippet does this:
- creates two lookup tables in
__init__
- the layer is called with input of shape
(batch_size, 2)
:
- first column contains indices of user embeddings
- second column contains indices of movie embeddings
- these embeddings are multiplied and summed returning
(batch_size,)
(so it's different from nn.Linear
which would return (batch_size, out_features)
and perform dot product instead of element-wise multiplication followed by summation like here)
This is probably used to train both representations (of users and movies) for some recommender-like system.
Other stuff
I know it does some faster computational version of a dot product
where one of the matrices is a one-hot encoded matrix and the other is
the embedding matrix.
No, it doesn't. torch.nn.Embedding
can be one hot encoded and might also be sparse, but depending on the algorithms (and whether those support sparsity) there might be performance boost or not.