How to implement full text search on complex nested JSONB in Postgresql
Asked Answered
N

2

12

I have pretty complex JSONB stored in one jsonb column.

DB table looks like:

 CREATE TABLE sites (
   id text NOT NULL,
   doc jsonb,
   PRIMARY KEY (id)
 )

Data we are storing in doc column is a complex nested JSONB data:

   {
      "_id": "123",
      "type": "Site",
      "identification": "Custom ID",
      "title": "SITE 1",
      "address": "UK, London, Mr Tom's street, 2",
      "buildings": [
          {
               "uuid": "12312",
               "identification": "Custom ID",
               "name": "BUILDING 1",
               "deposits": [
                   {
                      "uuid": "12312",
                      "identification": "Custom ID",             
                      "audits": [
                          {
                             "uuid": "12312",         
                              "sample_id": "SAMPLE ID"                
                          }
                       ]
                   }
               ]
          } 
       ]
    }

So structure of my JSONB looks like:

SITE 
  -> ARRAY OF BUILDINGS
     -> ARRAY OF DEPOSITS
       -> ARRAY OF AUDITS

We need to implement full text search by some values in each of type of entry:

SITE (identification, title, address)
BUILDING (identification, name)
DEPOSIT (identification)
AUDIT (sample_id)

SQL query should run a full text search in these field values only.

I guess need to use GIN indexes and something like tsvector, but do not have enough Postgresql background.

So, my question is it possible to index and then query such nested JSONB structures?

Nonconcurrence answered 14/8, 2017 at 18:39 Comment(5)
The first shot would be "denormalize" the storage: sacrifice some storage place for brevity. Extract needed data in separate fields: site, building, deposit, audit, containing pure string contantenation of needed fields, i.e. building.identification ||';'||building.title||';'||building.address etc. (this can be done using postgres' functions as default values, or trigger-based, if your data is modified). Then create GIN indexes on those fields -> and then construct your corresponding full text queries on those fieldsKeelson
Thanks @IlyaDyoshin . I like your idea - will try to experiment with it.Nonconcurrence
or you can wait till 10.0 release - where json/jsonb FTS will be first class citizen postgresql.org/docs/10/static/release-10.htmlKeelson
wow thanks for info @IlyaDyoshin :beers:Nonconcurrence
Guys, see @Nick 's approach below. I tested it - it works brilliant + 1Nonconcurrence
J
18

Let's add new column of tsvector type:

alter table sites add column tsvector tsvector;

Now let's create a trigger that will collect lexems, organize them and put to our tsvector. We will use 4 groups (A, B, C, D) -- this a special tsvector's feature that allows you to distinguish lexems later, at search time (see examples in manual https://www.postgresql.org/docs/current/static/textsearch-controls.html; unfortunately, this feature support only up to 4 groups becase developers reserved only 2 bits for that, but we are lucky here, we need only 4 groups):

create or replace function t_sites_tsvector() returns trigger as $$
declare
  dic regconfig;
  part_a text;
  part_b text;
  part_c text;
  part_d text;
begin
  dic := 'simple'; -- change if you need more advanced word processing (stemming, etc)

  part_a := coalesce(new.doc->>'identification', '') || ' ' || coalesce(new.doc->>'title', '') || ' ' || coalesce(new.doc->>'address', '');

  select into part_b string_agg(coalesce(a, ''), ' ') || ' ' || string_agg(coalesce(b, ''), ' ')
  from (
    select 
      jsonb_array_elements((new.doc->'buildings'))->>'identification',
      jsonb_array_elements((new.doc->'buildings'))->>'name'
  ) _(a, b);

  select into part_c string_agg(coalesce(c, ''), ' ')
  from (
    select jsonb_array_elements(b)->>'identification' from (
      select jsonb_array_elements((new.doc->'buildings'))->'deposits'
    ) _(b)
  ) __(c);

  select into part_d string_agg(coalesce(d, ''), ' ')
  from (
    select jsonb_array_elements(c)->>'sample_id'
    from (
      select jsonb_array_elements(b)->'audits' from (
        select jsonb_array_elements((new.doc->'buildings'))->'deposits'
      ) _(b)
    ) __(c)
  ) ___(d);

  new.tsvector := setweight(to_tsvector(dic, part_a), 'A')
    || setweight(to_tsvector(dic, part_b), 'B')
    || setweight(to_tsvector(dic, part_c), 'C')
    || setweight(to_tsvector(dic, part_d), 'D')
  ;
  return new;
end;
$$ language plpgsql immutable;

create trigger t_sites_tsvector
  before insert or update on sites for each row execute procedure t_sites_tsvector();

^^ -- scroll it, this snippet is bigger than it looks (especially of you have MacOS w/o scrollbars...)

Now let's create GIN index to speedup search queries (makes sense if you have many rows -- say, more than hundreds or thousands):

create index i_sites_fulltext on sites using gin(tsvector);

And now we insert something to check:

insert into sites select 1, '{
      "_id": "123",
      "type": "Site",
      "identification": "Custom ID",
      "title": "SITE 1",
      "address": "UK, London, Mr Tom''s street, 2",
      "buildings": [
          {
               "uuid": "12312",
               "identification": "Custom ID",
               "name": "BUILDING 1",
               "deposits": [
                   {
                      "uuid": "12312",
                      "identification": "Custom ID",
                      "audits": [
                          {
                             "uuid": "12312",
                              "sample_id": "SAMPLE ID"
                          }
                       ]
                   }
               ]
          }
       ]
    }'::jsonb;

Check with select * from sites; – you must see that tsvector column is filled with some data.

Now let's query it:

select * from sites where tsvector @@ to_tsquery('simple', 'sample');

-- it must return our record. In this case, we search for 'sample' word and we don't care in which group it will be found.

Let's change it and try to search only in group A ("SITE (identification, title, address)" as you described it):

select * from sites where tsvector @@ to_tsquery('simple', 'sample:A');

-- this must return nothing because word 'sample' sits only in group D ("AUDIT (sample_id)"). Indeed:

select * from sites where tsvector @@ to_tsquery('simple', 'sample:D');

-- will again return us our record.

Notice, that you need to use to_tsquery(..), not plainto_tsquery(..) to be able to address 4 groups. So you need to sanitize your input yourself (avoid using or remove special characters like & and | because they have special meaning in tsquery values).

And the good news is that you can combine different groups in a single query, like this:

select * from sites where tsvector @@ to_tsquery('simple', 'sample:D & london:A');

The other way to go (e.g. if you have to work with more than 4 groups) is having multiple tsvectors, each one sitting in a separate column, build them using single query, create index (you can create single index on multiple tsvector columns) and query addressing separate columns. It's similar to what I explained above, but perhaps less efficient.

Hope this helps.

Jerol answered 22/8, 2017 at 3:9 Comment(2)
Thank you so much @Jerol . Will take a look into your suggestion soon.Nonconcurrence
Hey @Jerol I forgot to say you big thanks ) I tested you approach and it works brilliant! Thanks you so much friendNonconcurrence
H
1

Things seem a little simpler in Postgres 10, as the to_tsvector function supports json. So for example this works nicely:

UPDATE dataset SET search_vector = to_tsvector('english',
'{
  "abstract":"Abstract goes here",
  "useConstraints":"None",
  "dataQuality":"Good",
  "Keyword":"historic",
  "topicCategory":"Environment",
  "responsibleOrganisation":"HES"
}'::json)
where dataset_id = 4;

Note I haven't tried this on a deeply nested structure, but don't see why it wouldn't work

Hallette answered 6/9, 2019 at 13:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.