Cosine similarity between 0 and 1
Asked Answered
S

2

10

I am interested in calculating similarity between vectors, however this similarity has to be a number between 0 and 1. There are many questions concerning tf-idf and cosine similarity, all indicating that the value lies between 0 and 1. From Wikipedia:

In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (using tf–idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.

The peculiarity is that I wish to calculate the similarity between two vectors from two different word2vec models. These models have been aligned, though, so they should in fact represent their words in the same vector space. I can calculate the similarity between a word in model_a and a word in model_b like so

import gensim as gs
from sklearn.metrics.pairwise import cosine_similarity

model_a = gs.models.KeyedVectors.load_word2vec_format(model_a_path, binary=False)
model_b = gs.models.KeyedVectors.load_word2vec_format(model_b_path, binary=False)

vector_a = model_a[word_a].reshape(1, -1)
vector_b = model_b[word_b].reshape(1, -1)

sim = cosine_similarity(vector_a, vector_b).item(0)

But sim is then a similarity metric in the [-1,1] range. Is there a scientifically sound way to map this to the [0,1] range? Intuitively I would think that something like

norm_sim = (sim + 1) / 2

is okay, but I'm not sure whether that is good practice with respect to the actual meaning of cosine similarity. If not, are other similarity metrics advised?

The reason why I am trying to get the values to be between 0 and 1 is because the data will be transferred to a colleague who will use it as a feature for her machine learning system, which expects all values to be between 0 and 1. Her intuition was to take the absolute value, but that seems to me to be a worse alternative because then you map opposites to be identical. Considering the actual meaning of cosine similarity, though, I might be wrong. So if taking the absolute value is the good approach, we can do that as well.

Sensory answered 26/5, 2019 at 19:53 Comment(7)
What they're talking about is just a vector dot product with normalized (unit) vector lengths. That's just the cosine of the angle between the two vectors (again, if you scale the lengths to one). That's where the 90 degree reference comes in, since above 90 degrees it would become negative.Vulgarity
Why do you need the value to be in the 0 to 1 range? ("Dense" embeddings like word2vec have vectors in very direction from the origin, hence cosine-similarities can be negative. Plain TF-IDF, on bag-of-words word counts, is where the results will only be 0 to 1.) Your norm_sim rescaling of -1.0 to 1.0 to 0.0 to 1.0 is fine, if your only purpose is to get 0.0-1.0 ranges... but of course the resulting value isn't a true cosine-similarity anymore. Does that matter? Unclear without knowing your other goals & reason for wanting a 0.0-1.0 score, but probably not.Zachar
Thanks for your interest @gojomo. I have added a final paragraph to explain why I need this value in that range. If there is a way to force word2vec to only produce positive vectors, that would be cool though - even though I'm not sure how that could possibly work considering the semantics of word2vec.Sensory
OK, that's a fair reason to prefer 0.0-1.0 (though many learning algorithms should do just fine with a -1.0 to 1.0 range). It won't necessarily matter that the values aren't real full-range angles any more. (If the algorithm needed real angles, it'd work with -1.0 to 1.0.) Using the absolute value would be a bad idea, as it would change the rank order of similarities – moving some results that are "natively" most-dissimilar way up.Zachar
There's been work on constraining word-vectors to have only non-negative values in dimensions, & the usual benefit is that the resulting dimensions are more likely to be individually interpretable. (See for example cs.cmu.edu/~bmurphy/NNSE/.) However, gensim doesn't support this variant, & only trying it could reveal whether it would be better for any particular project.Zachar
Also, there's other research that suggests usual word-vectors may not be 'balanced' around the origin (so you'll see fewer negative cosine-similiarities than would be expected from points in a random hypersphere), and that shifting them to be more balanced will usually improve them for other tasks. See: arxiv.org/abs/1702.01417v2Zachar
@Zachar Great. If you could copy that into an answer, I can accept it.Sensory
Z
6

You have a fair reason to prefer 0.0-1.0 (though many learning algorithms should do just fine with a -1.0 to 1.0 range). Your norm_sim rescaling of -1.0 to 1.0 to 0.0 to 1.0 is fine, if your only purpose is to get 0.0-1.0 ranges... but of course the resulting value isn't a true cosine-similarity anymore.

It won't necessarily matter that the values aren't real full-range angles any more. (If the algorithm needed real angles, it'd work with -1.0 to 1.0.)

Using the signless absolute value would be a bad idea, as it would change the rank order of similarities – moving some results that are "natively" most-dissimilar way up.

There's been work on constraining word-vectors to have only non-negative values in dimensions, & the usual benefit is that the resulting dimensions are more likely to be individually interpretable. (See for example https://cs.cmu.edu/~bmurphy/NNSE/.) However, gensim doesn't support this variant, & only trying it could reveal whether it would be better for any particular project.

Also, there's other research that suggests usual word-vectors may not be 'balanced' around the origin (so you'll see fewer negative cosine-similiarities than would be expected from points in a random hypersphere), and that shifting them to be more balanced will usually improve them for other tasks. See: https://arxiv.org/abs/1702.01417v2

Zachar answered 5/6, 2019 at 16:43 Comment(2)
Hi @Zachar would you explain why non-negative values make the embedding interpretable? How do you still make sense of each of the embedding dimension?Continuous
You'd have to consult the paper for any reasoning that supports their approach.Zachar
O
0

Just an update to @gojomo's answer, I think you need to have interpretable word embeddings which contain Non-negative values in dimensions (as opposed to the original word2vec model proposed by Mikolov et al.). In this sense, you will be able to get word similarities using Cosine Similarity between 0-1 as desired.

This paper is a good kickoff for this problem: https://www.aclweb.org/anthology/D15-1196

Omnivore answered 22/8, 2019 at 8:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.