PostgreSQL: How to go around ts_vector size limitations?
Asked Answered
P

1

12

I'm creating a search inside a Rails app using the pg_search gem. However, one of the tables have a Text datatype field that it's content happens to be a little larger than usual.

Now when I need to setup a tsvector column for the text columns, I face some limitations that due the the text field size vs tsvector size.

ERROR: string is too long for tsvector (5068741 bytes, max 1048575 bytes)

Is there any way that I determine condition to skip bigger Text fields while creating the tsvector column in the SQL trigger to do something like this:

pseudocode:

execute(<<-TRIGGERSQL)
CREATE OR REPLACE FUNCTION public.essays_before_insert_update_row_tr()
 RETURNS trigger
 LANGUAGE plpgsql
AS $function$
BEGIN
    If (SELECT LEN(body_text) FROM essays) <= 1048575
      new.tsv_body_text := to_tsvector('pg_catalog.english', coalesce(new.body_text,''));
      RETURN NEW;
    End
END;
$function$
  TRIGGERSQL

  # no candidate create_trigger statement could be found, creating an adapter-specific one
  execute("CREATE TRIGGER essays_before_insert_update_row_tr BEFORE INSERT OR UPDATE ON \"essays\" FOR EACH ROW EXECUTE PROCEDURE essays_before_insert_update_row_tr()")

related question that I found without an answer:

Postgresql - converting text to ts_vector

Plantar answered 26/5, 2015 at 22:38 Comment(0)
T
8

A simple workaround is to just invoke to_tsvector() with a truncated text value. For example, using the trigger example from the Postgres manual as starting point this approach looks like this:

CREATE FUNCTION essays_tsv_trigger_fn() RETURNS trigger AS $$
begin
    new.tsv_body_text := to_tsvector('english', left(new.body_text, 4*1024*1024));
    return new;
end
$$ LANGUAGE plpgsql;

CREATE TRIGGER essays_tsv_trigger BEFORE INSERT OR UPDATE
    ON essays FOR EACH ROW EXECUTE FUNCTION essays_tsv_trigger_fn();

This truncates the document's content to 4 MiB which should be useful enough for many document collections. Instead of just ignoring 'overly' long documents you include at least parts of it. In my experience, 4 MiB works well for technical english documentation. Depending on the size of the actually used vocabulary you could even succeed when truncating with a larger value like 10 MiB.

If you really want to ignore too long documents you could guard the to_tsvector() assignment with an if statement like this:

CREATE FUNCTION essays_tsv_trigger_fn() RETURNS trigger AS $$
begin
    if length(new.body_text) <= 4*1024*1024 then
        new.tsv_body_text := to_tsvector('english', new.body_text);
    end if;
  return new;
end
$$ LANGUAGE plpgsql;
Telluric answered 10/8, 2019 at 15:35 Comment(2)
Sorry, but I don't get how you get the document's size to be 4MiB using the left function. Isn't that function about getting as many characters from a text as specified?Assyrian
@BurakKaymakci from the horses mouth: left() => 'Return first n characters in the string.' As I wrote, that left() expression in my example truncates the document (which is sotred in the body_text column) to 4 MiB. Sure, depending on the text encoding some characters might encode to more than 1 byte. Thus, left(..., 4*1024*1024) approximately truncates the document to its first 4 MiBs, to be extra precise.Telluric

© 2022 - 2024 — McMap. All rights reserved.