How to store n-dimensional vector in Microsoft SQL Server?
Asked Answered
T

4

7

I want to store a large n-dimensional vector (e.g. an embedding vector) in SQL Server as a piece of metadata associated with another row.

In this example, it will be a 384-dimensional vector, for example:

[0.161391481757164,   -0.23294533789157867, -0.5648667216300964,  -0.3210797905921936,  -0.03274689242243767,  0.011770576238632202, -0.06612513959407806,
-0.14662186801433563, -0.17081189155578613,  0.2879514992237091,  -0.1932784765958786,   0.009713868610560894, 0.23330552875995636,   0.03551964834332466,
-0.20526213943958282,  0.06445703655481339, -0.3146169185638428,   0.5788811445236206,   0.09118294715881348, -0.0048667509108781815,-0.16503077745437622,
 0.25162017345428467, -0.36395764350891113, -0.34742429852485657,  0.0526515394449234,   0.08912508934736252,  0.48464590311050415,  -0.04224267974495888,
 0.32445403933525085, -0.6847451329231262,  -0.20959551632404327, -0.027657458558678627, 0.20439794659614563,  0.6859520077705383,   -0.4988805055618286,
-0.26204171776771545, -0.18842612206935883,  0.07067661732435226,  0.02633148804306984,  0.03182782977819443,  0.28935596346855164,  -0.0016041728667914867,
 0.14609676599502563, -0.36272501945495605,  0.10288259387016296, -0.3651926815509796,  -0.3823530375957489,   0.14052163064479828,   0.006418740376830101,
 0.11741586774587631, -0.6509529948234558,  -0.15997739136219025, -0.42837604880332947,  0.12351743131875992,  0.0485026054084301,    0.24820692837238312,
 0.46972623467445374, -0.47954055666923523, -0.5238635540008545,  -0.3543052673339844,   0.22626525163650513,  0.18406584858894348,   0.6463921070098877,
 0.11894208937883377, -0.07143554836511612,  0.004256516695022583, 0.10088140517473221,  0.3335645794868469,   0.16905969381332397,   0.056856121867895126,
 0.11355260014533997,  0.3708053231239319,  -0.7484591603279114,   0.17503942549228668, -0.3249044418334961,   0.5901510715484619,    0.41506800055503845,
 0.05852462351322174,  0.5119204521179199,   0.2750142216682434,  -0.2058306783437729,   0.8199670314788818,   0.16698679327964783,  -0.1572146713733673,
 0.014733579009771347 ,0.0168467964977026,   0.4688740372657776,  -0.07839230448007584,  0.49326324462890625, -0.29934313893318176,   0.21525822579860687,
 0.1396997570991516,  -0.3420834243297577,  -0.5197309851646423,   0.10842061042785645, -0.0338996984064579,   0.35846689343452454,  -0.1660442352294922,
 0.15579357743263245,  0.015674782916903496,-0.8510578870773315,  -0.07501569390296936, -0.1791406124830246,   0.14926102757453918,  -0.2269722819328308,
 0.42619261145591736,  0.09489753842353821, -0.13341256976127625,  0.3312526345252991,   0.22534190118312836,  0.0679713636636734,    0.17042726278305054,
 0.14300595223903656, -0.06654901057481766, -0.2170567661523819,  -0.454984188079834,   -0.5516679286956787,  -0.10752955824136734,  -0.05743071809411049,
 0.32108309864997864, -0.5445901155471802,  -0.43162357807159424,  0.08207866549491882,  0.0664522647857666,   0.4478979706764221,    0.2190810590982437,
-0.05722910910844803, -0.0932786613702774,   0.01758035272359848,  0.16166797280311584,  0.44004616141319275, -0.21601708233356476,   0.43121641874313354,
 0.32022470235824585, -0.014045504853129387,-0.24948528409004211, -0.4389941990375519,   0.3816317319869995,  -0.5687862038612366,    0.1088542640209198,
-0.403241366147995,    0.08174201846122742,  0.21350793540477753,  0.2396722435951233,   0.4973253607749939,   0.31202447414398193,  -0.5260801315307617,
-0.3351263403892517,  -0.04100760444998741,  0.6609364151954651,  -0.2047063261270523,   0.19385716319084167, -0.5661329627037048,   -0.27058693766593933,
-0.1637117713689804,   0.30641692876815796, -0.08894442766904831, -0.052735116332769394,-0.13839660584926605, -0.6741533875465393,    0.05569711700081825,
-0.04354270175099373,  0.20251914858818054,  0.24813368916511536,  0.1719648838043213,   0.26782000064849854,  0.3137670159339905,    0.18599936366081238,
 0.23953016102313995,  0.17769533395767212,  0.46293920278549194, -0.19122551381587982, -0.5595004558563232,   0.09755659103393555,   0.3125424385070801,
-0.5813230276107788,  -1.0698442459106445,  -0.09045401215553284, -0.08948248624801636, -0.051830895245075226,-0.0001317809073952958,-0.08400193601846695,
 0.25725823640823364, -0.10135184973478317,  0.07884480804204941,  0.2091679722070694,   0.3950233459472656,   0.2745698094367981,   -0.872776448726654,
-0.16590780019760132,  0.4308463931083679,  -0.24375642836093903, -0.02120584435760975,  0.05213866010308266, -0.19898287951946259,  -0.5506985187530518,
 0.40167248249053955,  0.1640072464942932,  -0.010167916305363178, 0.14038121700286865,  0.4958030879497528,  -0.7259818315505981,   -0.24387206137180328,
 0.08528701961040497,  0.03415993973612785, -0.16687284409999847,  0.3804749548435211,  -0.08561687171459198, -0.2752263844013214,    0.5883951783180237,
-0.3283255994319916,  -0.12724250555038452,  0.08751262724399567, -0.44206979870796204, -0.11079336702823639, -0.16302113234996796,   0.11022322624921799,
-0.09404750168323517, -0.256179541349411,    0.20473307371139526,  0.41829538345336914, -0.1095203086733818,   0.02342342585325241,  -0.18814104795455933,
-0.2540932893753052,   0.48397907614707947,  0.03593514859676361, -0.089835524559021,   -0.6478171944618225,  -0.1757517009973526,    0.0672023594379425,
 0.0695127546787262,  -0.6398074626922607,  -0.03958022966980934, -0.10351496934890747,  0.22433893382549286,  0.6756673455238342,   -0.2924160957336426,
 0.17503827810287476,  0.12915058434009552, -0.239552840590477,    0.15498916804790497, -0.4730042815208435,  -0.12289212644100189,  -0.004052990116178989,
 0.11593572050333023, -0.1965983510017395,   0.5210273265838623,  -0.18184830248355865,  0.2579534947872162,  -0.1920309066772461,   -0.389960378408432,
 0.04139290377497673, -0.11638019979000092, -0.10620912909507751, -0.5321099162101746,   0.13135096430778503, -0.07761876285076141,  -0.0830138698220253,
-0.01572849042713642,  0.31080499291419983, -0.41445496678352356,  0.1609737128019333,   0.5787453651428223,  -0.05459209159016609,   0.1318219006061554,
-0.06957206130027771,  0.15152350068092346, -0.07094550132751465, -0.196294367313385,    0.12644843757152557,  0.23419199883937836,   0.5845456719398499,
-0.19989481568336487, -0.19607964158058167, -0.19692276418209076, -0.08633144199848175, -0.004551170393824577, 0.09362921118736267,  -0.14167727530002594,
-0.14917594194412231,  0.31781134009361267,  0.18779256939888,     0.42154577374458313, -0.20578211545944214,  0.14142100512981415,  -0.5664211511611938,
 0.18177354335784912,  0.14776530861854553,  0.29254236817359924,  0.17831481993198395, -0.1894354224205017,  -0.2836195230484009,   -0.4065170884132385,
-0.14325398206710815,  0.17800962924957275,  0.7763587832450867,   0.5497004389762878,  -0.00946379080414772, -0.48568078875541687,  -0.022227048873901367,
-0.005903944373130798, 0.4351034462451935,   0.05010621249675751, -0.12799566984176636, -0.06675072759389877,  0.167253315448761,    -0.1653994619846344,
 0.21004730463027954,  0.2765181362628937,   0.5885812640190125,  -0.326379656791687,   -0.007390940561890602, 0.27159956097602844,  -0.043763305991888046,
-0.39229199290275574, -0.19412016868591309,  0.4250912666320801,   0.6105153560638428,  -0.06168382614850998, -0.5341082811355591,   -0.611929714679718,
 0.08125612139701843, -0.1779184639453888,   0.5319408774375916,  -0.23601730167865753,  0.22285249829292297, -0.32505497336387634,   0.2152460366487503,
 0.4679816663265228,   0.048206135630607605,-0.24099768698215485, -0.30208054184913635,  0.13667792081832886,  0.3552468717098236,   -0.12280546128749847,
-0.006191314198076725,-0.10851636528968811,  0.08330328017473221, -0.09545236080884933, -0.02249046228826046,  0.0003346469602547586,-0.12273653596639633,
-0.05594412609934807,  0.027804357931017876,-0.4045255482196808,  -0.18987023830413818, -0.0027474926318973303,0.30244430899620056,   0.2323288917541504,
-0.2729185223579407,   0.12836921215057373,  0.27967774868011475,  0.3031359016895294,   0.41273725032806396, -0.06173351779580116,   0.33845168352127075,
 0.26775869727134705, -0.2933143079280853,  -0.0485006645321846,   0.11777450144290924,  0.6205862760543823,  -0.07637807726860046,  -0.19466432929039001,
-0.3994691073894501,   0.15689416229724884, -0.11139731854200363, -0.2333720475435257,   0.2364773154258728,   0.30898618698120117,  -0.1263875812292099,
-0.231489360332489,    0.34536853432655334,  0.6001318097114563,  -0.44741731882095337,  0.07382357120513916, -0.019649405032396317, -0.1029537245631218,
 0.369470477104187,   -0.032077688723802567,-0.13972929120063782,  0.24549521505832672, -0.13091856241226196, -0.029257331043481827]

Attempt#1 - n-dimensional vector → n-columns

My first thought was to store the 384 real values in a separate table, with a key to the original row (vertical partitioning):

CREATE TABLE Embeddings (
   RowGUID uniquedientifier NOT NULL PRIMARY KEY,
   f1 real NOT NULL, 
   f2 real NOT NULL,
   f3 real NOT NULL,
   f4 real NOT NULL,
   f5 real NOT NULL,
   f6 real NOT NULL,
   f7 real NOT NULL,
   f8 real NOT NULL,
   f9 real NOT NULL,
   f10 real NOT NULL,
   ...snip...
   f384 real NOT NULL)
RowGUID f1 f2 f3 f4 f5 f6 f7 ... f384
6ba7b814-9dad-11d1-80b4-00c04fd430c8 0.161391481757164 -0.23294533789157867 -0.5648667216300964 -0.3210797905921936 -0.03274689242243767 0.011770576238632202 -0.06612513959407806 ... -0.029257331043481827

This...sorta...works. But it is unwieldy. Plus, my vectors today happen to be 385-dimensional; but they may soon be 1556-dimensional, which exceeds the SQL Server maximum of 1,024 columns per table.

Attempt#2 - n-dimensional → n-IEEE 32-bit floats → varbinary(n*4)

The next idea was to pack the 4-byte (32-bit) floats into a varbinary column:

CREATE TABLE Embeddings (
   RowGUID uniquedientifier NOT NULL PRIMARY KEY,
   PackedVector varbinary(1516) NOT NULL -- 384 floats * 4 bytes = 1540 bytes
)
0x0000000100000002000000030000000400000005000000060000000700000008...0000017F
  \______/\______/\______/\______/\______/\______/\______/\______/   \______/
     f1      f2      f3      f4      f5      f6      f7      f8        f384

And then when it comes time to read each Single, use SUBSTRING to rip the 4-byte float out of the varbinary, and then convert it to a real:

DECLARE @f1 real = CAST(SUBSTRING(PackedVector, 0*4, 4) AS real);

Except two down-sides:

  • Downside#1: You cannot convert a binary(4) to a real (even though you can convert a real to a binary(4); just not the other way:

    enter image description here

    May be able to workarond it with decimal or numeric).

  • Downside#2: The math of computing the euclidian distance between two vectors is conceptually valid:

    DECLARE @target VARBINARY(1536) -- packed 384-dimensional vector
    
    SELECT TOP(10) RowGUID, SUM(POWER(CAST(SUBSTRING(Embedding, i*4+1, 4) AS real) - CAST(SUBSTRING(@target, i*4 + 1, 4) AS real), 2)) as distance
    FROM Embeddings
    CROSS APPLY (VALUES (0), (1), (2), ..., (383)) AS sequence(i) -- Fill in the values from 0 to 383
    GROUP BY RowGUID
    ORDER BY distance ASC
    

    But that will be pretty poorly performing (even if issue #1 didn't exist).

Attempt 3 - Do what Joe Celko does

Many years ago, someone on the Microsoft newsgroups had the same question:

Can anyone point me to a reference or discuss the best way to store a vector of 120 to 480 numbers in the database? Rows seem to be out since we would quickly top the billion row mark. A table with 480 columns is too unnormalized. A single varchar(max) column? This seems the best answer for now unless there is a more efficiant way of storing it.

Thanks for any help or opinions,

And then --CELKO-- responded:

I think of a vector as a particular kind of mathematical structure and you seem to be talking about a list of some kind. Vectors have a fixed number of dimensions, etc. Here is a guess:

CREATE TABLE Vectors (
   vector_id CHAR(3) NOT NULL, --whatever
   dim_nbr   INTEGER NOT NULL,
   CHECK (dim_nbr BETWEEN 1 AND 480),
   PRIMARY KEY (vector_id, dim_nbr),
   dim_val   INTEGER NOT NULL
);

Making the values of a the vector into rows:

Embeddings

RowGUID dimNumber dimValue
6ba7b814-9dad-11d1-80b4-00c04fd430c8 1 0.161391481757164
6ba7b814-9dad-11d1-80b4-00c04fd430c8 2 -0.23294533789157867
6ba7b814-9dad-11d1-80b4-00c04fd430c8 3 -0.5648667216300964
6ba7b814-9dad-11d1-80b4-00c04fd430c8 4 -0.3210797905921936
6ba7b814-9dad-11d1-80b4-00c04fd430c8 5 -0.03274689242243767
6ba7b814-9dad-11d1-80b4-00c04fd430c8 6 0.011770576238632202
6ba7b814-9dad-11d1-80b4-00c04fd430c8 7 -0.06612513959407806
... ... ...
6ba7b814-9dad-11d1-80b4-00c04fd430c8 384 -0.029257331043481827

This is probably the best approach.

Nothing better?

Doesn't SQL Server has better support for vectors? I know there is GEOSPATIAL/GEOGRAPHY types, but i gather those only work for 2-dimensional vectors (e.g. lattuitude+logitude)? Can't they be abused to solve the problem?

And since the goal is to compute euclidian distance between two vectors, is there a data structure that does a better job of allowing math? (varchar? xml? json? varbinary? variant?)

Bonus Reading

Thill answered 3/5, 2023 at 18:44 Comment(1)
geography goes up to 4 dimensions I think, but that won't help you. Sounds like you need an SQLCLR type, but you can't use them in Azure thoughBimolecular
S
5

You may also try something like explained here

There is no specific data type available to store a vector in Azure SQL database, but we can use some human ingenuity to realize that a vector is just a list of numbers. As a result, we can store a vector in a table very easily by creating a column to contain vector data. One row per vector element. We can then use a columnstore index to efficiently store and search for vectors.

So create a table to hold the vectors:

CREATE TABLE [dbo].[embeddings]
(
    [article_id] [int] NOT NULL,
    [vector_value_id] [int] NOT NULL,
    [vector_value] [float] NOT NULL
) 

And then we can use T-SQL to efficiently compute distances:

On that table we can create a column store index to efficiently store and search for vectors. Then it is just a matter of calculating the distance between vectors to find the closest. Thanks to the internal optimization of the columnstore (that uses SIMD AVX-512 instructions to speed up vector operations) the distance calculation is extremely fast.

The most common distance is the cosine similarity, which can be calculated quite easily in SQL.

SELECT 
    SUM(a.value * b.value) / (  
        SQRT(SUM(a.value * a.value)) * SQRT(SUM(b.value * b.value))   
    ) AS cosine_similarity
FROM
    vectors_values

The SQL query is calculating the cosine similarity between two vectors, represented by the columns a.value and b.value. Cosine similarity is defined as follows:

LaTeX

Where:

  • LaTeX is the dot product of vectors a and asdf
  • (|A|) and (|B|) are the magnitudes of vectors (A) and (B), respectively

The important parts of the query are:

  1. SUM(a.value * b.value): Calculates the dot product of vectors (A) and (B).
  2. SQRT(SUM(a.value * a.value)): Calculates the magnitude of vector (A).
  3. SQRT(SUM(b.value * b.value)): Calculates the magnitude of vector (B).

Finally, the entire expression divides the dot product by the product of the magnitudes to find the cosine similarity, which is returned as cosine_similarity.

Sororicide answered 31/8, 2023 at 19:7 Comment(0)
R
2

I've solved exactly the same task using plain json text as a vector store.

I used openjson() to convert it to a table on the fly.

I created a function to build a binary search index from a set of vectors, and a function to search vectors in a given vicinity of a vector using binary index.

These simple steps turn SQL Server in vector database.

See vector-database.

Reconnoiter answered 22/6, 2023 at 20:34 Comment(0)
V
1

The real advantage of vector databases (I work for Weaviate fwiw) is that it allows ANN-based vector indices to be built. Vector searches are computationally very expensive, and vector DBs allow searches that are orders magnitudes faster than say, brute-force searches through a Numpy array or something.

I would definitely not advise saving vectors into SQL if you want to do any kind of similarity-based retrieval, which is probably why you have vectors in the first place.

Vernverna answered 3/5, 2023 at 23:36 Comment(0)
I
-1

Maybe you could try MyScale DB, which is a SQL-based vector database.

Intertwist answered 4/5, 2023 at 4:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.