How to get cosine distance between two vectors in postgres?
Asked Answered
M

3

11

I am wondering if there is a way to get cosine distance of two vectors in postgres. For storing vectors I am using CUBE data type.

Below is my table definition:

test=# \d vectors                                                                                                                                
                            Table "public.vectors"
 Column |  Type   | Collation | Nullable |               Default               
--------+---------+-----------+----------+-------------------------------------
 id     | integer |           | not null | nextval('vectors_id_seq'::regclass)
 vector | cube    |           |          | 

Also, sample data is given below:

test=# select * from vectors order by id desc limit 2;
   id    |                  vector                  
---------+------------------------------------------
 2000000 | (109, 568, 787, 938, 948, 126, 271, 499)
 1999999 | (139, 365, 222, 653, 313, 103, 215, 796)

I actually can write my own PLPGSql function for this, but wanted to avoid this as it might not be efficient.

Multifarious answered 29/6, 2019 at 10:33 Comment(0)
O
15

About your table

First of all, I believe you should change your data type to plain array.

CREATE TABLE public.vector ( 
  id serial NOT NULL,
  vctor double precision [3] --for three dimensional vectors; of course you can change the dimension or leave it unbounded if you need it.
 );

INSERT INTO public.vector (vctor) VALUES (ARRAY[2,3,4]);
INSERT INTO public.vector (vctor) VALUES (ARRAY[3,4,5]);

So

SELECT * FROM public.vector;

Will result in the following data

   id |   vctor
------|---------
    1 | {2,3,4}
    2 | {3,4,5}

Maybe not the answer you expected but consider this

As you may know already, calculating the cosine between the vectors involves calculating the magnitudes. I don't think the problem is the algorithm but the implementation; it requires calculating squares and square roots that is expensive for a RDBMS.

Now, talking about efficiency; the server process does not take the load when calling mathematical functions. In PostgreSQL, the mathematical functions (look here) run from the C library so they are pretty efficient. However, in the end, the host has to assign some resources to make these calculations.

I would indeed think carefully before implementing these rather costly operations inside the server. But there is not a right answer; it depends on how you are using the database. For example if it is a production database with thousands of concurrent users, I would move this kind of calculation elsewhere (a middle layer or a user application.) But if there are few users and your database is for a small research operation, then it is fine to implement it as a stored procedure or a process running inside your server but keep in mind this will affect scalability or portability. Of course, there are more considerations like how many rows will be processed, or whether or not you intend to fire triggers, etc.

Consider other alternatives

Make a client app

You can do a fast and decent program in VB or the language of your choice. And let the client app make the heavy calculation and use the database for what it does best that is storing and retrieving data.

Store the data differently

For this particular example, you could store the unit vectors plus the magnitude. In this way, finding the cosine between any two vectors reduces simply to the dot product of the unit vectors (only multiplication and division and no squares nor square roots.)

CREATE TABLE public.vector ( 
     id serial NOT NULL,
     uvctor double precision [3], --for three dimensional vectors; of course you can change the dimension or make it decimal if you need it
     magnitude double precision
 ); 

INSERT INTO public.vector (vctor) VALUES (ARRAY[0.3714, 0.5571, 0.7428], 5.385); -- {Ux, Uy, Uz}, ||V|| where V = [2, 3, 4];
INSERT INTO public.vector (vctor) VALUES (ARRAY[0.4243, 0.5657, 0.7071], 7.071); -- {Ux, Uy, Uz}, ||V|| where V = [3, 4, 5];

SELECT a.vctor as a, b.vctor as b, 1-(a.uvctor[1] * b.uvctor[1] + a.uvctor[2] * b.uvctor[2] + a.uvctor[3] * b.uvctor[3]) as cosine_distance FROM public.vector a
JOIN public.vector b ON a.id != b.id;

Resulting in

                          a  |                           b  | cosine_distance
-----------------------------|------------------------------|------------------
{0.3714,0.5571,0.7428,5.385} | {0.4243,0.5657,0.7071,7.071} |      0.00202963
{0.4243,0.5657,0.7071,7.071} | {0.3714,0.5571,0.7428,5.385} |      0.00202963

Even if you have to calculate the magnitude of the vector inside the server, you will make it once per vector and not every time you need to get the distance between two of them. This becomes more important as the number of rows is increasing. For 1000 vectors for example, you would have to calculate the magnitude 999000 times if you wanted to obtain the cosine difference between any two vectors using the original vector components.

Any combination of the above

Conclusion

When we pursue efficiency, most of the times there is not a canonical answer. Instead we have trade-offs that we have to consider and evaluate. It always depends on the ultimate goal we need to achieve. Databases are excellent for storing and retrieving data; they can definitely make other things but that comes with an added cost. If we can live with the added overhead then it's fine; otherwise we have to consider alternatives.

Oscillatory answered 6/7, 2019 at 11:39 Comment(6)
Hey, thanks for the answer. Let me tell you my ultimate goal. I am storing 2 million records in the DB, I need to find the similar vectors for a given vector. So I was contemplating to calculate different distances between vectors, such as Euclidean, cosine etc, Using cube data type , performance of finding the Euclidean distance was acceptable. I just wanted similar performance for cosine distance. As you can see the data size is big, so calculating the cosine distance in a client app might not be a good idea.Multifarious
Yes for 2M records it is quite expensive to make an on-the-fly calculation for any given vector against all the set; for this I would favor the unit vector representation. With the cosine similarity between two vectors (1-cosine_distance) you can calculate the euclidean distance using the law of cosines; you could even store the square of the magnitude of every vector in the table. I wonder, how the data is generated? How will you use the results (i.e. will you calculate the metrics trough queries or reports; is this part of a bigger system that uses an app anyway?)Oscillatory
that's really insightful. I will sure give it a try. 1. How the data is generated >> there are some third party libraries available which generates vectors ( eg: OpenCV ), 2. How will you use the results >> I am working on a POC, I had a initial plan to use sql queries to get the reports. Also, I am going to mark this answer as accepted, as it may not be the best way to achieve the desired result, but definitely it looks like one of the appropriate solution to me.Multifarious
@Oscillatory Why do you subtract from 1?Docia
@Docia The scalar product of two unit vectors is the similarity (or the cosine of the angle.) Distance = 1 - Similarity. Two unit vectors with the same orientation have a similarity = 1 and a distance = 0; to orthogonal unit vectors have a similarity = 0 and a distance = 1: and two opposed unit vectors have a similarity = -1 and a distance = 2.Oscillatory
Right, yeah, that makes sense. I misread the original question as cosine similarity (which is for my own use case) instead of cosine distance. Thanks.Docia
S
8

you can take reference to my code.

--for calculation of norm vector --
CREATE or REPLACE FUNCTION public.vector_norm(IN vector double precision[])
    RETURNS double precision AS 
$BODY$

BEGIN

    RETURN(SELECT SQRT(SUM(pow)) FROM (SELECT POWER(e,2) as pow from unnest(vector) as e) as norm);
END;
$BODY$ LANGUAGE 'plpgsql'; 
ALTER FUNCTION public.vector_norm(double precision[]) OWNER TO postgres;

COMMENT ON FUNCTION public.vector_norm(double precision[]) IS 'This function is used to find a norm of vectors.';

--call function--
select public.vector_norm('{ 0.039968978613615,0.357211461290717,0.753132887650281,0.760665621142834,0.20826127845794}')




--for caculation of dot_product--
CREATE OR REPLACE FUNCTION public.dot_product(IN vector1 double precision[], IN vector2 double precision[])
    RETURNS double precision    
AS $BODY$
BEGIN
    RETURN(SELECT sum(mul) FROM (SELECT v1e*v2e as mul FROM unnest(vector1, vector2) AS t(v1e,v2e)) AS denominator);
END;
$BODY$ LANGUAGE 'plpgsql';

ALTER FUNCTION public.dot_product(double precision[], double precision[]) OWNER TO postgres;

COMMENT ON FUNCTION public.dot_product(double precision[], double precision[])
    IS 'This function is used to find a cosine similarity between two multi-dimensional vectors.';


--call fuction--
SELECT public.dot_product(ARRAY[ 0.039968978613615,0.357211461290717,0.753132887650281,0.760665621142834,0.20826127845794],ARRAY[ 0.039968978613615,0.357211461290717,0.753132887650281,0.760665621142834,0.20826127845794])



--for calculatuion of cosine similarity--
CREATE OR REPLACE FUNCTION public.cosine_similarity(IN vector1 double precision[], IN vector2 double precision[])
    RETURNS double precision
    LANGUAGE 'plpgsql'

AS $BODY$
BEGIN
    RETURN(select ((select public.dot_product(ARRAY[ 0.63434,0.23487,0.324323], ARRAY[ 0.63434,0.23487,0.324323]) as dot_pod)/((select public.vector_norm(ARRAY[ 0.63434,0.23487,0.324323]) as norm1) * (select public.vector_norm(ARRAY[ 0.63434,0.23487,0.324323]) as norm2))) AS similarity_value) 
END;
$BODY$;

ALTER FUNCTION public.cosine_similarity(double precision[], double precision[])
    OWNER TO postgres;

COMMENT ON FUNCTION public.cosine_similarity(double precision[], double precision[])
    IS 'this function is used to find a cosine similarity between two vector';
Soup answered 25/9, 2019 at 9:50 Comment(3)
+1 for the answer. It would add more value if you can briefly elaborate your approach and provide some performance stats of this function.Multifarious
your function cosine_similarity uses some arbitrary vectors instead of its parameters. did you forgot something?Linter
also you have a syntax error. here i fixed it for you: CREATE OR REPLACE FUNCTION public.cosine_similarity(IN vector1 double precision[], IN vector2 double precision[]) RETURNS double precision LANGUAGE 'plpgsql' AS $BODY$ BEGIN RETURN ( select ((select public.dot_product(vector1, vector2) as dot_pod)/((select public.vector_norm(vector1) as norm1) * (select public.vector_norm(vector2) as norm2))) AS similarity_value ); END; $BODY$; Linter
V
-1

In PostgreSQL, you can calculate the cosine similarity between two vectors using the built-in pg_trgm extension, which provides functions for similarity operations, including cosine similarity. The pg_trgm extension is commonly used for similarity calculations involving text and vectors.

Here's a step-by-step guide on how to calculate cosine similarity between two vectors using PostgreSQL and the pg_trgm extension:

Enable the pg_trgm extension (if not already enabled):

CREATE EXTENSION IF NOT EXISTS pg_trgm;

Prepare your data and create a table to store the vectors. For this example, let's assume you have a table named vectors_table with columns vector_id and vector_values, where vector_values contains the vector values stored as an array.

CREATE TABLE vectors_table (
    vector_id SERIAL PRIMARY KEY,
    vector_values double precision[]
);

Insert your vector data into the table. Replace the vector_values array with your actual vector values.

INSERT INTO vectors_table (vector_values)
VALUES
    ('{0.5, 0.3, 0.8, 0.1}'),
    ('{0.2, 0.7, 0.4, 0.6}'),
    -- Add more vectors as needed
    ('{0.9, 0.2, 0.1, 0.5}');

Calculate the cosine similarity between two vectors using a custom function. Here's the function to calculate the cosine similarity:

CREATE OR REPLACE FUNCTION cosine_similarity(a double precision[], b double precision[])
RETURNS double precision AS $$
DECLARE
    dot_product double precision;
    norm_a double precision;
    norm_b double precision;
BEGIN
    dot_product := 0;
    norm_a := 0;
    norm_b := 0;

    FOR i IN 1..array_length(a, 1) LOOP
        dot_product := dot_product + a[i] * b[i];
        norm_a := norm_a + a[i] * a[i];
        norm_b := norm_b + b[i] * b[i];
    END LOOP;

    norm_a := sqrt(norm_a);
    norm_b := sqrt(norm_b);

    IF norm_a = 0 OR norm_b = 0 THEN
        RETURN 0;
    ELSE
        RETURN dot_product / (norm_a * norm_b);
    END IF;
END;
$$ LANGUAGE plpgsql;

Now you can use the cosine_similarity function to compute the similarity between any two vectors in the vectors_table:

SELECT
    vector_id AS vector1_id,
    v1.vector_values AS vector1,
    vector_id AS vector2_id,
    v2.vector_values AS vector2,
    cosine_similarity(v1.vector_values, v2.vector_values) AS similarity
FROM
    vectors_table v1
CROSS JOIN
    vectors_table v2
WHERE
    v1.vector_id <> v2.vector_id;

This query will give you a result showing the pairwise cosine similarity between all vectors in the table, excluding self-similarity.

Remember that cosine similarity measures the cosine of the angle between two vectors and ranges from -1 to 1. A value of 1 indicates the vectors are identical, 0 means the vectors are orthogonal, and -1 means they are diametrically opposed.

Note: Modify your code accordingly.

Verdure answered 24/7, 2023 at 6:49 Comment(2)
This answer -to me- looks like ChatGPT generated.Chemiluminescence
@doneforaiur then take a shortcut from me for the whole conversation. Enable PG Vector extension on your schema, then store the embedding in a column with the datatype as vector, there after you can use the queries like SELECT * FROM items WHERE 1 - (embedding <=> '[3,1,2]') desc limit 10;Verdure

© 2022 - 2024 — McMap. All rights reserved.