Postgres: index on cosine similarity of float arrays for one-to-many search
Asked Answered
M

2

11

Cosine similarity between two equally-sized vectors (of reals) is defined as the dot product divided by the product of the norms.

To represent vectors, I have a large table of float arrays, e.g. CREATE TABLE foo(vec float[])'. Given a certain float array, I need to quickly (with an index, not a seqscan) find the closest arrays in that table by cosine similarity, e.g. SELECT * FROM foo ORDER BY cos_sim(vec, ARRAY[1.0, 4.5, 2.2]) DESC LIMIT 10; But what do I use?

pg_trgm's cosine similarity support is different. It compares text, and I'm not sure what it does exactly. An extension called smlar (here) also has cosine similarity support for float arrays but again is doing something different. What I described is commonly used in data analysis to compare features of documents, so I was thinking there'd be support in Postgres for it.

Maurinemaurise answered 28/6, 2017 at 1:13 Comment(13)
Can you clarify what you mean by "with an index"? Cosine similarity is a binary operation which, in the structure you describe, will operate on pairs of rows. Indexes don't take pairs of rows.Grandiloquence
@Grandiloquence < is also a binary operator, but there is btree index support in Postgres to speed up queries with filtering and ordering. <@ is a binary operator for arrays -- and again, there is GiST and GIN support to speed up queries.Dc
@Maurinemaurise can you explain "pg_trgm's cosine similarity support is different" ? there are examples in the PDF from Bartunov/Sigaev with ORDER BY. The task of finding "most similar / closest objects" is called kNN search -- they implemented it for euclidian disnance in GiST few years before that PDF published -- see pgcon.org/2010/schedule/attachments/168_pgcon-2010-1.pdf, so I assume they considered the same goals (not only "find all objects withing some range", but kNN also) for cosine similarity later.Dc
@Dc -- Is it possible, do you think, to create a GiST or GIN index of the cosine similarity of all pairs of rows in a table?Grandiloquence
It is definitely possible, this is what "smlar" extension was created for. The only open question to me, will it be possible to have index scans for kNN search (i.e. does it support ORDER BY .. LIMIT ..) or such index only supports only range lookups (like "get all records where similarity < some value").Dc
Those Russian guys don't create extensions w/o indexing support :) They are authors of full text search (with GiST, GIN insexes), arrays indexing capabilities (GIN and RD-tree via GiST), GiST version of R-tree (replaced original R-tree implementation), hstore, indexing support for jsonb and many more. I would be very surprised if "smlar" was implemented w/o indexing support.Dc
@Dc smlar's cosine similarity metric is defined as N_i / sqrt(N_a * N_b), where N_i is the number of unique elements in the intersection and N_a and N_b are number of unique elements in each vector. (1.1, 2.0) and (1.0, 2.1) have 0 similarity by their metric. They explain how the index works in their paper, and it seems this an approximation aimed for certain use cases that makes the indexing possible (unless there's some more complicated way). pg_trgm doesn't say what it uses, but it seems to be similar. It also only takes text arguments, so it can't be what I'm looking for.Maurinemaurise
By the way, searching online, there seem to be two different meanings of "cosine similarity." Sometimes I get the set distance metric that smlar and (I think) pg_trgm use, and sometimes I get the angle distance that Wikipedia describes.Maurinemaurise
sigaev.ru/git/gitweb.cgi?p=smlar.git;a=tree;hb=HEAD tells us that there should be both GIN and GiST support in smlar extension (see files smlar_gin.c and smlar_gist.c)Dc
Usually, distance is defined as 1/similarity. So greater similarity, 'more similar' objects are -- less distance is between them.Dc
Oh. "N_i is the number of unique elements in the intersection and N_a and N_b are number of unique elements in each vector" looks more like Jaccard similarity (and not exactly, some weird variant), definitely not like cosine similarity... Where did you take that definition? Source code or docs? Can you provide a link?Dc
@Dc It's in the paper they wrote (sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf) and the source code (lines 670 and 806 github.com/jirutka/smlar/blob/master/smlar.c). Just in case, I also tried a few examples in SQL. Re GIN/GiST: Yeah, it supports both.Maurinemaurise
FWIW, this post has a postgresql implementation of cosine similarity.Numismatist
P
6

If you're ok with an inexact solution, you can use random projection: https://en.wikipedia.org/wiki/Random_projection.

Randomly generate k different vectors of the same length as your other vectors and store them somewhere. You will use these to spatially bin your data. For each vector in your table, do a dot product with each of the random vectors and store the product's sign.

Vectors with the same sign for each random vector go in the same bin, and generally vectors that have high cosine similarity will end up in the same bin. You can pack the signs as bits into an integer and use a normal index to pull vectors in the same bin as your query, then do a sequential search to find those with the highest cosine similarity.

Pout answered 30/10, 2018 at 17:34 Comment(5)
I've heard of dimensionality reduction but haven't seen this kind. This is cool because you could even do it with just Postgres tables/queries if you didn't want to write an extension. But it wouldn't have worked for my case because my data was already "compressed" down from higher dimensionality (it's natural language) using SVD or Glove. I was finding that 300d gave more accurate results than 100d. So really it was the full 300 dimensions that I wanted.Maurinemaurise
Random projection can be used as dimensionality reduction, but in this case we're leaving the vectors as they are and using the random projection for binning. The random projection is being used as a key into a spatial hash table instead of a lower-dimensional representation. So your vectors will stay at the full 300 dimensions, but it will be easier to pull a group of similar vectors on an index. An analogy in 2D would be using the x and y axis as your "random" vectors and grouping your vectors by the quadrant they're in.Pout
This does look similar to new scientific document that utilizes random binary forest to divide vector space into non-overlapping cells of convex hyper-polyhedrons that are bounded by hyper planes. I will definitely try both of them out, thank you!Propend
@HenryConklin Oh right, I confused myself. That makes sense, and I agree this should work.Maurinemaurise
Years later, I was poking through my SO stuff and decided this was the more correct answer, though mine works without having to write a custom extension. Also felt bad giving myself the checkmark given how much great help you all provided here.Maurinemaurise
M
11

I gather that no extension that does this, so I've found a limited workaround:

If A and B are both normalized (length 1), cos(A, B) = 1 - 0.5 * ||A - B||^2. ||A - B|| is the Euclidean distance, and cos(A, B) is the cosine similarity. So greater Euclidean distance <=> lesser cosine similarity (makes sense intuitively if you imagine a unit circle), and if you have non-normal vectors, changing their magnitudes without changing their directions doesn't affect their cosine similarities. Great, so I can normalize my vectors and compare their Euclidean distances...

There's a nice answer here about Cube, which supports n-dimensional points and GiST indexes on Euclidean distance, but it only supports 100 or fewer dimensions (can be hacked higher, but I had issues around 135 and higher, so now I'm afraid). Also requires Postgres 9.6 or later.

So:

  1. Make sure I don't care about having at most 100 dimensions. Upgrade to Postgres 9.6 or later.
  2. Fill my table with arrays to represent vectors.
  3. Normalize the vectors to create an extra column of cube points. Create a GiST index on this column.
  4. Order by Euclidean distance ascending to get cosine similarity descending: EXPLAIN SELECT * FROM mytable ORDER BY normalized <-> cube(array[1,2,3,4,5,6,7,8,9,0]) LIMIT 10;

If I need more than 100 dimensions, I might be able to achieve this using multiple indexed columns. Will update the answer in that case.

Update: Pretty sure there's nothing I can do with splitting the >100-dimension vector into multiple columns. I end up having to scan the entire table.

Maurinemaurise answered 28/6, 2017 at 18:15 Comment(4)
Is there any progress? Euclidean distance relation nicely reduces it to closest pair of points problem, but indexing problem seems quite difficult. Perhaps Locality-sensitive hashing may work?Propend
@Propend Hi! Well, that extension does the indexing, and they have a paper on the hashing mechanism, except Postgres has that arbitrary limitation on an index's per-row size. I can go to around 180 dimensions if I edit the extension to use float4 instead of float8 (aka double). This project was abandoned for unrelated reasons, so I haven't poked at the indexing issue since. Also, I suppose Postgres isn't what people use for latency-sensitive machine learning model inference ;)Maurinemaurise
Thank you for the response! I use Postgres as well and want it to make cosine similarity search over 512 dimensional vectors. Unfortunately, I couldn't find anything efficient (other than full scan) - this question was the only hope. Thus I have to write my own searching function I believe.Propend
Henry Conklin's answer might be suitable for your needs.Maurinemaurise
P
6

If you're ok with an inexact solution, you can use random projection: https://en.wikipedia.org/wiki/Random_projection.

Randomly generate k different vectors of the same length as your other vectors and store them somewhere. You will use these to spatially bin your data. For each vector in your table, do a dot product with each of the random vectors and store the product's sign.

Vectors with the same sign for each random vector go in the same bin, and generally vectors that have high cosine similarity will end up in the same bin. You can pack the signs as bits into an integer and use a normal index to pull vectors in the same bin as your query, then do a sequential search to find those with the highest cosine similarity.

Pout answered 30/10, 2018 at 17:34 Comment(5)
I've heard of dimensionality reduction but haven't seen this kind. This is cool because you could even do it with just Postgres tables/queries if you didn't want to write an extension. But it wouldn't have worked for my case because my data was already "compressed" down from higher dimensionality (it's natural language) using SVD or Glove. I was finding that 300d gave more accurate results than 100d. So really it was the full 300 dimensions that I wanted.Maurinemaurise
Random projection can be used as dimensionality reduction, but in this case we're leaving the vectors as they are and using the random projection for binning. The random projection is being used as a key into a spatial hash table instead of a lower-dimensional representation. So your vectors will stay at the full 300 dimensions, but it will be easier to pull a group of similar vectors on an index. An analogy in 2D would be using the x and y axis as your "random" vectors and grouping your vectors by the quadrant they're in.Pout
This does look similar to new scientific document that utilizes random binary forest to divide vector space into non-overlapping cells of convex hyper-polyhedrons that are bounded by hyper planes. I will definitely try both of them out, thank you!Propend
@HenryConklin Oh right, I confused myself. That makes sense, and I agree this should work.Maurinemaurise
Years later, I was poking through my SO stuff and decided this was the more correct answer, though mine works without having to write a custom extension. Also felt bad giving myself the checkmark given how much great help you all provided here.Maurinemaurise

© 2022 - 2024 — McMap. All rights reserved.