indexing and query high dimensional data in postgreSQL
Asked Answered
M

3

13

I want to index data in height dimensions (128 dimensional vectors of integers in range of [0,254] are possible):

| id |      vector       |
|  1 | { 1, 0, ..., 254} |
|  2 | { 2, 128, ...,1}  |
|  . | { 1, 0, ..., 252} |
|  n | { 1, 2, ..., 251} |

I saw that PostGIS implemented R-Trees. So can I use these trees in PostGIS to index and query multidimensional vectors in Postgres?

I also saw that there is a index implementation for int arrays.

Now I have questions about how to perform a query.
Can I perform a knn-search and a radius search on an integer array? Maybe I also must define my own distance function. Is this possible? I want to use the Manhattan distance (block distance) for my queries.

I also can represent my vector as a binary string with the pattern v1;v2;...;vn. Does this help to perform the search?

For example if I had these two string:

1;2;1;1
1;3;2;2

The result / distance between these two strings should be 3.

Mariammarian answered 15/2, 2016 at 9:2 Comment(2)
With your example it isn't exactly clear - do you want to see how many elements differ (levenshtein distance) or how much do they differ (Manhattan distance), eg (3-2) + (2-1) + (2-1).Hummel
@hruske: I want to know how much do they differ (manhattan, taxlicab, block)- distanceMariammarian
H
20

Perhaps a better choice would be the cube extension, since your area of interest is not individual integer, but full vector.

Cube supports GiST indexing, and Postgres 9.6 will also bring KNN indexing to cubes, supporting euclidean, taxicab (aka Manhattan) and chebishev distances.

It is a bit annoying that 9.6 is still in development, however there's no problem backporting patch for cube extension to 9.5 and I say that from experience.

Hopefully 128 dimensions will still be enough to get meaningful results.

How to do this?

First have an example table:

create extension cube;
create table vectors (id serial, vector cube);

Populate table with example data:

insert into vectors select id, cube(ARRAY[round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000)]) from generate_series(1, 2000000) id;

Then try selecting:

explain analyze SELECT * from vectors
order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=123352.07..123352.09 rows=10 width=76) (actual time=1705.499..1705.501 rows=10 loops=1)
   ->  Sort  (cost=123352.07..129852.07 rows=2600000 width=76) (actual time=1705.496..1705.497 rows=10 loops=1)
         Sort Key: (('(966, 82, 765, 343, 600, 718, 338, 505)'::cube <#> vector))
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Seq Scan on vectors  (cost=0.00..67167.00 rows=2600000 width=76) (actual time=0.038..998.864 rows=2600000 loops=1)
 Planning time: 0.172 ms
 Execution time: 1705.541 ms
(7 rows)

We should create an index:

create index vectors_vector_idx on vectors (vector);

Does it help:

explain analyze SELECT * from vectors
order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10;

--------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.41..1.93 rows=10 width=76) (actual time=41.339..143.915 rows=10 loops=1)
   ->  Index Scan using vectors_vector_idx on vectors  (cost=0.41..393704.41 rows=2600000 width=76) (actual time=41.336..143.902 rows=10 loops=1)
         Order By: (vector <#> '(966, 82, 765, 343, 600, 718, 338, 505)'::cube)
 Planning time: 0.146 ms
 Execution time: 145.474 ms
(5 rows)

At 8 dimensions, it does help.

Hummel answered 15/2, 2016 at 10:45 Comment(2)
I also can represent the vector as a binary string(v1;v2;v3...;vn).Is there a option to compare binary strings with manhattan distance?Mariammarian
Taxicab distance is also known as Manhattan distance. en.wikipedia.org/wiki/Taxicab_geometryHummel
W
12

(Addendum to selected answer)

For people wanting more than 100 dimensions, beware: there's a 100 dimensions limit in cube extension.

The tricky part is that postgres allows you to create cubes with more than 100 dimensions just fine. It's when you try to restore a backup that it is refused (the worst time to realize that).

As recommended in documentation, I patched cube extension to support more dimensions. I made a docker image for it, and you can look at the Dockerfile to see how to do it yourself, from the github repos.

Weepy answered 4/11, 2016 at 17:5 Comment(0)
L
0

Cross-posting my answer from this similar DBA Stack Exchange question:


The 2023 answer to this is to use the pgvector extension.

As of this writing, the number of maximum dimensions (defined here) is 16k.

Loeb answered 5/4, 2023 at 14:55 Comment(5)
It's unfortunate that pgvector is not supported in AWS RDS hosted Postgres. Realistically my current project is not going to move off RDS. cube is on RDS though.Concessive
@Concessive for sure -- pgvector recommends (here) to contact AWS via this page about adding extensions. I feel like there's a good chance of pgvector making it in considering the excitement around embeddings right now.Loeb
What's the difference between pgvector and cube?Randell
cube is limited in the number of dimensions it supports. From the postgres docs: "To make it harder for people to break things, there is a limit of 100 on the number of dimensions of cubes". Depending on the use case, it may be enough or not (AI embeddings typically require a few thousand dimensions)Loeb
pgVector is indeed supported now by RDS, AWS posted the announcement recently. See aws.amazon.com/about-aws/whats-new/2023/05/…Lachlan

© 2022 - 2024 — McMap. All rights reserved.