I am interested in calculating similarity between vectors, however this similarity has to be a number between 0 and 1. There are many questions concerning tf-idf and cosine similarity, all indicating that the value lies between 0 and 1. From Wikipedia:
In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (using tf–idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.
The peculiarity is that I wish to calculate the similarity between two vectors from two different word2vec models. These models have been aligned, though, so they should in fact represent their words in the same vector space. I can calculate the similarity between a word in model_a
and a word in model_b
like so
import gensim as gs
from sklearn.metrics.pairwise import cosine_similarity
model_a = gs.models.KeyedVectors.load_word2vec_format(model_a_path, binary=False)
model_b = gs.models.KeyedVectors.load_word2vec_format(model_b_path, binary=False)
vector_a = model_a[word_a].reshape(1, -1)
vector_b = model_b[word_b].reshape(1, -1)
sim = cosine_similarity(vector_a, vector_b).item(0)
But sim
is then a similarity metric in the [-1,1] range. Is there a scientifically sound way to map this to the [0,1] range? Intuitively I would think that something like
norm_sim = (sim + 1) / 2
is okay, but I'm not sure whether that is good practice with respect to the actual meaning of cosine similarity. If not, are other similarity metrics advised?
The reason why I am trying to get the values to be between 0 and 1 is because the data will be transferred to a colleague who will use it as a feature for her machine learning system, which expects all values to be between 0 and 1. Her intuition was to take the absolute value, but that seems to me to be a worse alternative because then you map opposites to be identical. Considering the actual meaning of cosine similarity, though, I might be wrong. So if taking the absolute value is the good approach, we can do that as well.
norm_sim
rescaling of -1.0 to 1.0 to 0.0 to 1.0 is fine, if your only purpose is to get 0.0-1.0 ranges... but of course the resulting value isn't a true cosine-similarity anymore. Does that matter? Unclear without knowing your other goals & reason for wanting a 0.0-1.0 score, but probably not. – Zachar