fastText embeddings sentence vectors?
Asked Answered
S

3

21

I wanted to understand the way fastText vectors for sentences are created. According to this issue 309, the vectors for sentences are obtained by averaging the vectors for words.

In order to confirm this, I wrote the following script:

import numpy as np
import fastText as ft

# Loading model for Finnish.
model = ft.load_model('cc.fi.300.bin')

# Getting word vectors for 'one' and 'two'.
one = model.get_word_vector('yksi')
two = model.get_word_vector('kaksi')

# Getting the sentence vector for the sentence "one two" in Finnish.
one_two = model.get_sentence_vector('yksi kaksi')
one_two_avg = (one + two) / 2

# Checking if the two approaches yield the same result.
is_equal = np.array_equal(one_two, one_two_avg)

# Printing the result.
print(is_equal)

# Result: FALSE

But, It seems that the obtained vectors are not similar.

Why aren't both values the same? Would it be related to the way I am averaging the vectors? Or, maybe there is something I am missing?

Shrike answered 14/1, 2019 at 12:1 Comment(0)
M
31

First, you missed the part that get_sentence_vector is not just a simple "average". Before FastText sum each word vector, each vector is divided with its norm (L2 norm) and then the averaging process only involves vectors that have positive L2 norm value.

Second, a sentence always ends with an EOS. So if you try to calculate manually you need to put EOS before you calculate the average.

try this (I assume the L2 norm of each word is positive):


def l2_norm(x):
   return np.sqrt(np.sum(x**2))

def div_norm(x):
   norm_value = l2_norm(x)
   if norm_value > 0:
       return x * ( 1.0 / norm_value)
   else:
       return x

# Getting word vectors for 'one' and 'two'.
one = model.get_word_vector('yksi')
two = model.get_word_vector('kaksi')
eos = model.get_word_vector('\n')

# Getting the sentence vector for the sentence "one two" in Finnish.
one_two = model.get_sentence_vector('yksi kaksi')
one_two_avg = (div_norm(one) + div_norm(two) + div_norm(eos)) / 3

You can see the source code here or you can see the discussion here.

Maretz answered 24/5, 2019 at 9:16 Comment(4)
its more or less an average but an average of unit vectors.Manriquez
Please note that l2 norm can't be negative: it is 0 or a positive number. If l2 norm is 0, it makes no sense to divide by it.Car
From your link, we only normalize the vectors if args_->model != model_name::sup? I suppose that means we don't do it in the supervised case.Fishery
@Maretz Can you please explain why do we need to include the vector for \n? Does it tamper with the definition of sentence and/or a multi-phrase word ?James
M
2

Even though it is an old question, fastText is a good starting point to easily understand generating sentence vectors by averaging individual word vectors and explore the simplicity, advantages and shortcomings and try out other things like SIF or SentenceBERT embeddings or (with an API key if you have one) the OpenAI embeddings. So I would like to mention that the use of eos as mentioned by @maliboro in one of the answers is not correct. This can be checked by the code below:

import fasttext.util

fasttext.util.download_model('en', if_exists='ignore')
ft_en_model = fasttext.load_model('cc.en.300.bin')

def normalize_vector(vec):
    norm = np.sqrt(np.sum(vec**2))
    if not norm==0:
        return vec/norm
    else:
        return vec

vec1 = normalize_vector(ft_en_model.get_word_vector('Paris'))
vec2 = normalize_vector(ft_en_model.get_word_vector('is'))
vec3 = normalize_vector(ft_en_model.get_word_vector('the'))
vec4 = normalize_vector(ft_en_model.get_word_vector('capital'))
vec5 = normalize_vector(ft_en_model.get_word_vector('of'))
vec6 = normalize_vector(ft_en_model.get_word_vector('France'))

sent_vec = (vec1+vec2+vec3+vec4+vec5+vec6)/6.0
print(sent_vec[0:10])

vec_s1 = ft_en_model.get_sentence_vector('Paris is the capital of France')
print(vec_s1[0:10])

The answer in both cases is:

[-0.00648477 -0.01590857 -0.02449585 -0.00863768 -0.00655541  0.00647134
  0.01945119 -0.00058179 -0.03748131  0.01811352]
Mccoy answered 5/5, 2023 at 12:38 Comment(0)
S
0

You might be hitting an issue with floating point math - e.g. if one addition was done on a CPU and one on a GPU they could differ.

The best way to check if it's doing what you want is to make sure the vectors are almost exactly the same.

You might want to print out the two vectors and manually inspect them, or do the dotproduct of one_two minus one_two_avg on itself (i.e. the length of the difference between the two).

Stunt answered 14/1, 2019 at 16:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.