Cosine similarity between two equally-sized vectors (of reals) is defined as the dot product divided by the product of the norms.
To represent vectors, I have a large table of float
arrays, e.g. CREATE TABLE foo(vec float[])'
. Given a certain float
array, I need to quickly (with an index, not a seqscan) find the closest arrays in that table by cosine similarity, e.g. SELECT * FROM foo ORDER BY cos_sim(vec, ARRAY[1.0, 4.5, 2.2]) DESC LIMIT 10;
But what do I use?
pg_trgm
's cosine similarity support is different. It compares text, and I'm not sure what it does exactly. An extension called smlar
(here) also has cosine similarity support for float arrays but again is doing something different. What I described is commonly used in data analysis to compare features of documents, so I was thinking there'd be support in Postgres for it.
<
is also a binary operator, but there is btree index support in Postgres to speed up queries with filtering and ordering. <@ is a binary operator for arrays -- and again, there is GiST and GIN support to speed up queries. – Dcsmlar
's cosine similarity metric is defined as N_i / sqrt(N_a * N_b), where N_i is the number of unique elements in the intersection and N_a and N_b are number of unique elements in each vector. (1.1, 2.0) and (1.0, 2.1) have 0 similarity by their metric. They explain how the index works in their paper, and it seems this an approximation aimed for certain use cases that makes the indexing possible (unless there's some more complicated way).pg_trgm
doesn't say what it uses, but it seems to be similar. It also only takes text arguments, so it can't be what I'm looking for. – Maurinemaurisesmlar
and (I think)pg_trgm
use, and sometimes I get the angle distance that Wikipedia describes. – Maurinemaurisesmlar
extension (see files smlar_gin.c and smlar_gist.c) – Dc