Postgres Full-Text Search with Hyphen and Numerals
Asked Answered
E

2

5

I have observed what seems to me an odd behavior of Postgres' to_tsvector function.

SELECT to_tsvector('english', 'abc-xyz');

returns

'abc':2 'abc-xyz':1 'xyz':3

However,

SELECT to_tsvector('english', 'abc-001');

returns

'-001':2 'abc':1

Why not something like this?

'abc':2 'abc-001':1 '001':3

And what should I do to be able to search by the numeric portion alone, without the hyphen?

Earthenware answered 4/9, 2019 at 19:58 Comment(0)
F
3

This can be circumvented with PG13's dict-int addon's absval option. See the official documentation.

But in case you're stuck with an earlier PG version, here's the generalized version of a "number or negative number" workaround in a query.

select regexp_replace($$'test' & '1':* & '2'$$::tsquery::text,
            '''([.\d]+''(:\*)?)', '(''\1 | ''-\1)', 'g')::tsquery;

This results in:

'test' & ( '1':* | '-1':* ) & ( '2' | '-2' )

It replaces lexemes that look like positive numbers with "number or negative number" kind of subqueries.
The double cast ::tsquery::text is just there to show how you would pass a tsquery casted to text.
Note that it handles prefix matching numeric lexemes as well.

Friede answered 5/2, 2021 at 10:30 Comment(0)
D
9

Seems the text search parser identifies the hyphen followed by digits to be the sign of a signed integer. Debug with ts_debug():

SELECT * FROM ts_debug('english', 'abc-001');

   alias   |   description   | token | dictionaries | dictionary | lexemes 
-----------+-----------------+-------+--------------+------------+---------
 asciiword | Word, all ASCII | abc   | {simple}     | simple     | {abc}
 int       | Signed integer  | -001  | {simple}     | simple     | {-001}

Other text search configurations (like 'simple' instead of 'english') won't help as the parser itself is "at fault" here (debatable).

A simple way around it (other than modifying the parser, which I never tried) would to pre-process strings and replace hyphens with m-dash () or just blanks to make sure those are identified as "Space symbols". (Actual signed integers lose their negative sign in the process.)

SELECT to_tsvector('english', translate('abc-001', '-', '—'))
    @@ to_tsquery ('english', '001');  -- true now

db<>fiddle here

Demonetize answered 4/9, 2019 at 22:57 Comment(1)
Thanks! Calling "translate" won't work for me, since it won't work for other use cases not mentioned here, but this would: WHERE to_tsvector('english', field)) @@ to_tsquery ('english', '001') OR to_tsvector('english', field)) @@ to_tsquery ('english', '-001');Earthenware
F
3

This can be circumvented with PG13's dict-int addon's absval option. See the official documentation.

But in case you're stuck with an earlier PG version, here's the generalized version of a "number or negative number" workaround in a query.

select regexp_replace($$'test' & '1':* & '2'$$::tsquery::text,
            '''([.\d]+''(:\*)?)', '(''\1 | ''-\1)', 'g')::tsquery;

This results in:

'test' & ( '1':* | '-1':* ) & ( '2' | '-2' )

It replaces lexemes that look like positive numbers with "number or negative number" kind of subqueries.
The double cast ::tsquery::text is just there to show how you would pass a tsquery casted to text.
Note that it handles prefix matching numeric lexemes as well.

Friede answered 5/2, 2021 at 10:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.