Postgresql k-nearest neighbor (KNN) on multidimensional cube
Asked Answered
S

2

6

I have a cube that has 8 dimensions. I want to do nearest neighbor matching. I'm totally new to postgresql. I read that 9.1 supports nearest neighbor matching on multidimensions. I'd really appreciate if someone could give a complete example:

  1. How to create a table with the 8D cube ?

  2. Sample Insert

  3. Lookup - exact matching

  4. Lookup - nearest neighbor matching

Sample Data:

For simplicity sake, we can assume that all the values range from 0-100.

Point1: (1,1,1,1, 1,1,1,1)

Point2: (2,2,2,2, 2,2,2,2)

Look up value: (1,1,1,1, 1,1,1,2)

This should match against Point1 and not Point2.

Refs:

What's_new_in_PostgreSQL_9.1

https://en.wikipedia.org/wiki/K-d_tree#Nearest_neighbour_search

Stevestevedore answered 21/5, 2013 at 18:8 Comment(3)
Can you explain what data you have, maybe provide some small sample? I think 8D cube is just a table with 8 columns (dimensions).Spillage
I edited the question to include the sample data. Yes, 8D cube can be represented using 8 different numeric columns.Stevestevedore
I've added complete example to my original answer.Spillage
S
6

PostgreSQL supports distance operator <-> and as I understand it, this can be used for analyzing text (with pg_trgrm module) and geometry data type.

I do not know how you can use it with more than 1 dimension. Maybe you will have to define your own distance function or somehow convert your data to one column with text or geometry type. For example if you have table with 8 columns (8-dimensional cube):

c1 c2 c3 c4 c5 c6 c7 c8
 1  0  1  0  1  0  1  2

You can convert it to:

c1 c2 c3 c4 c5 c6 c7 c8
 a  b  a  b  a  b  a  c

And then to table with one column:

c1
abababac

Then you can use (after creating gist index):

SELECT c1, c1 <-> 'ababab'
 FROM test_trgm 
 ORDER BY c1 <-> 'ababab';

Example

Create sample data

-- Create some temporary data
-- ! Note that table are created in tmp schema (change sql to your scheme) and deleted if exists !
drop table if exists tmp.test_data;

-- Random integer matrix 100*8 
create table tmp.test_data as (
   select 
      trunc(random()*100)::int as input_variable_1,
      trunc(random()*100)::int as input_variable_2, 
      trunc(random()*100)::int as input_variable_3,
      trunc(random()*100)::int as input_variable_4, 
      trunc(random()*100)::int as input_variable_5, 
      trunc(random()*100)::int as input_variable_6, 
      trunc(random()*100)::int as input_variable_7, 
      trunc(random()*100)::int as input_variable_8
   from 
      generate_series(1,100,1)
);

Transform input data to text

drop table if exists tmp.test_data_trans;

create table tmp.test_data_trans as (
select 
   input_variable_1 || ';' ||
   input_variable_2 || ';' ||
   input_variable_3 || ';' ||
   input_variable_4 || ';' ||
   input_variable_5 || ';' ||
   input_variable_6 || ';' ||
   input_variable_7 || ';' ||
   input_variable_8 as trans_variable
from 
   tmp.test_data
);

This will give you one variable trans_variable where all the 8 dimensions are stored:

trans_variable
40;88;68;29;19;54;40;90
80;49;56;57;42;36;50;68
29;13;63;33;0;18;52;77
44;68;18;81;28;24;20;89
80;62;20;49;4;87;54;18
35;37;32;25;8;13;42;54
8;58;3;42;37;1;41;49
70;1;28;18;47;78;8;17

Instead of || operator you can also use the following syntax (shorter, but more cryptic):

select 
   array_to_string(string_to_array(t.*::text,''),'') as trans_variable
from 
   tmp.test_data t

Add index

create index test_data_gist_index on tmp.test_data_trans using gist(trans_variable);

Test distance Note: I've selected one row from table - 52;42;18;50;68;29;8;55 - and used slightly changed value (42;42;18;52;98;29;8;55) to test the distance. Of course, you will have completely different values in your test data, because it is RANDOM matrix.

select 
   *, 
   trans_variable <->  '42;42;18;52;98;29;8;55' as distance,
   similarity(trans_variable, '42;42;18;52;98;29;8;55') as similarity,
from 
   tmp.test_data_trans 
order by
   trans_variable <-> '52;42;18;50;68;29;8;55';

You can use distance operator <-> or similiarity function. Distance = 1 - Similarity

Spillage answered 22/5, 2013 at 7:10 Comment(3)
Thanks twn08. I ran into this error when I attempt to create the index: create index test_data_gist_index on tmp.test_data_trans using gist(trans_variable); ERROR: data type text has no default operator class for access method "gist" SQL state: 42704 Hint: You must specify an operator class for the index or define a default operator class for the data type.Stevestevedore
Maybe btree_gist is missing? Similar problem in this questionSpillage
I didn't see you define a distance metric between entries in the semicolon-joined column. Is the <-> operator using string distance or geometric distance of the decoded point?Anticlinorium
C
5

A "patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances" was recently offered on the pgsql-hackers list. It might work for your purpose, if you can customize your PostgreSQL build.

Note that the cube type, a PostgreSQL extension, can be used to represent points or cubes in n-dimensions. (The value of n can go up to 100 by default, more if a limit in cubedata.h is raised.) So, this patch should among other things enable index-assisted multidimensional point/vector/cube nearest-neighbor search.

(Without this patch, the cube type doesn't have a <-> distance operator, and a support function (#8) is missing from the OPERATOR CLASS gist_cube_ops which is needed to give gist the ability to make a distance-related index on these values.)

I haven't yet tried the patch, and note that one of the discussion-list replies suggests it may currently break some regression tests.

Casto answered 2/10, 2013 at 3:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.