Word separators for Postgres full text search with Rails
Asked Answered
Q

3

12

I'm using pg_search for some text searching within my model. Among other attributes, I have an url field.

Unfortuantelly Postgres doesn't seem to identify / and . as word separators, therefore I cannot search within the url.

Example: searching for test in http://test.com yields no results.

Is there a way to fix this problem, perhaps using another gem or some inline SQL ?

Quinby answered 30/12, 2012 at 18:11 Comment(1)
as a work-around you could have another column/attribute with the url and recognized word separators, and search against that insteadYe
Q
1

I ended up modifying the pg_search gem to support arbitrary ts_vector expressions instead of just column names. The changes are here

Now I can write:

pg_search_scope :search, 
    against: [[:title , 'B'], ["to_tsvector(regexp_replace(url, '[^\\w]+', ' ', 'gi'))", 'A']],
    using: {tsearch: {dictionary: "simple"}}
Quinby answered 6/1, 2013 at 10:44 Comment(0)
G
5

As stated in the documentation (and noticed by AJcodez), there is a solution in creating a dedicated column for tsvector index. Then define a trigger that catches insertions to index urls properly:

CREATE test_url (url varchar NOT NULL, url_tsvector tsvector NOT NULL);

This method will transorm any non alpha characters into single space and turn the string into a tsvector:

CREATE OR REPLACE FUNCTION generate_url_tsvector(varchar) 
RETURNS tsvector 
LANGUAGE sql 
AS $_$
    SELECT to_tsvector(regexp_replace($1, '[^\w]+', ' ', 'gi'));
$_$;

Now create a trigger that calls this function:

CREATE OR REPLACE FUNCTION before_insert_test_url()
RETURNS TRIGGER
LANGUAGE plpgsql AS $_$
BEGIN;
  NEW.url_tsvector := generate_url_tsvector(NEW.url); 

  RETURN NEW;
END;
$_$
;

CREATE TRIGGER before_insert_test_url_trig 
BEFORE INSERT ON test_url 
FOR EACH ROW EXECUTE PROCEDURE before_insert_test_url();

Now, when url are inserted, the `url_tsvectorè field will be automatically populated.

INSERT INTO test_url (url) VALUES ('http://www.google.fr');
TABLE test_url;

 id          url                     url_tsvector            

  2  http://www.google.fr  'fr':4 'googl':3 'http':1 'www':2 

(1 row)

To FT search on URLs you only need to query against this field.

SELECT * FROM test_url WHERE url_tsvector @@ 'google'::tsquery;
Gawky answered 2/1, 2013 at 10:53 Comment(2)
You don't even need the extra column if you use a functional index. Further, you can create a view to avoid the need to specify the full form of the expression used to create the index when performing a query.Elizabethelizabethan
I hacked that gem to support arbitrary ts_vector calls, which will allow me to use regexp_replace, as you suggested. Using insert/update triggers might have the benefit of performance, but I find this solution a bit too cumbersome for my needs. Thanks for the help.Quinby
Q
1

I ended up modifying the pg_search gem to support arbitrary ts_vector expressions instead of just column names. The changes are here

Now I can write:

pg_search_scope :search, 
    against: [[:title , 'B'], ["to_tsvector(regexp_replace(url, '[^\\w]+', ' ', 'gi'))", 'A']],
    using: {tsearch: {dictionary: "simple"}}
Quinby answered 6/1, 2013 at 10:44 Comment(0)
N
0

Slightly simpler approach, add the protocol token type to the simple dictionary:

ALTER TEXT SEARCH CONFIGURATION simple
    ADD MAPPING FOR protocol
        WITH simple;

you can also add it to the english dictionary if you need stemming

https://www.postgresql.org/docs/13/textsearch-parsers.html

https://www.postgresql.org/docs/13/sql-altertsconfig.html

Nix answered 5/1, 2022 at 19:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.