Is `ts_headline` intended to highlight non-matching parts of the query (which it does)?
Asked Answered
C

1

6

I'm running some complex FTS queries in Postgres 13.4 and discovered some behavior in ts_headline that's unexpected, at least to me, and wasn't sure if I've come across a feature or a bug. ;)

Initial sanity check:

SELECT plainto_tsquery('english', 'red dog') @@ to_tsvector('The quick brown fox jumped over the lazy dog.');

-- false

No surprises: the tsquery evaluates to 'red' & 'dog', the document does not contain 'red', no match. But when I try to get headlines:

SELECT ts_headline('The quick brown fox jumped over the lazy dog.', plainto_tsquery('english', 'red dog'));

-- The quick brown fox jumped over the lazy <b>dog</b>.

The same happens with the FOLLOWED_BY operator (<->); you can replace plainto_tsquery with phraseto_tsquery (or construct your own tsquery literals). It still highlights fragments that aren't actually matches.

The problem is not (or at the very least, not entirely) a result of trying to call ts_headline without having a real match. My original situation was actually more like:

SELECT ts_headline('I want a red dog, but not a black dog.  No red cats, either.', phraseto_tsquery('english', 'red dog'));

-- I want a <b>red</b> <b>dog</b>, but not a black <b>dog</b>.  No <b>red</b> cats, either.

In that case, given that the tsquery evaluates to 'red' <-> 'dog' (i.e., "red" immediately followed by "dog"), I'm surprised by the last two highlights.

The documentation for ts_headline says:

Displays, in an abbreviated form, the match(es) for the query in the document,

which leads me to believe this is a bug, but the longer blurb on highlighting results only says the function

returns an excerpt from the document in which terms from the query are highlighted

and indeed, terms from the query are highlighted...

I've tried messing with the options argument to ts_headline, and nothing has changed this behavior.

So... Am I calling it wrong, is it wrong, or do I just have wrong expectations for its behavior?

(Seems tangential to this older question, but it seems to be describing a different bug. And I can't tell what exactly is or isn't happening or supposed to happen in this scenario, so who knows if it's related or not.)

Circumrotate answered 10/10, 2021 at 4:53 Comment(2)
Not an answer, but there's an open ticket from 2018 that describes this, duplicated from this Database Administrators SE question. It appears to just be a low priority bug that no one's tried to fix yet.Fathomless
have you tried setting the dictionary for ts_headline, i.e. ts_headline('english', 'The quick brown fox jumped over the lazy dog.') ?Poliomyelitis
H
0

Yes, ts_headline is intended to highlight non-matching parts of the query. This is not a bug and it works as advertised in the documentation:

the function will use the query to select relevant text fragments, and then highlight all words that appear in the query, even if those word positions do not match the query's restrictions

To see it better, have a look at the following query:

SELECT ts_headline('The quick brown fox jumped over the lazy dog.',
        plainto_tsquery('english', 'red dog')) AS my_headline,
  'red dog' as my_query,
  'The quick brown fox jumped over the lazy dog.'
       @@ plainto_tsquery('english', 'red dog') AS is_there_a_match
UNION ALL
SELECT ts_headline('The quick brown fox jumped over the lazy dog.',
       plainto_tsquery('english', 'red dog')) AS my_headline,
   'dog' AS my_query,
   'The quick brown fox jumped over the lazy dog.'
        @@ plainto_tsquery('english', 'dog') AS is_there_a_match
  ;

Its result is:

my_headline my_query is_there_a_match
The quick brown fox jumped over the lazy dog. red dog false
The quick brown fox jumped over the lazy dog. dog true

(Stackoverflow takes the <b>dog</b> from PostgreSQL's results and turns it bold here.)

If you want to have only records that match the query, you need a WHERE condition. The following returns no record:

SELECT ts_headline('The quick brown fox jumped over the lazy dog.',
        plainto_tsquery('english', 'red dog')) AS my_headline,
  'red dog' as my_query,
  'The quick brown fox jumped over the lazy dog.'
       @@ plainto_tsquery('english', 'red dog') AS is_there_a_match
  WHERE 'The quick brown fox jumped over the lazy dog.'
       @@ plainto_tsquery('english', 'red dog');

while this shows one record:

SELECT ts_headline('The quick brown fox jumped over the lazy dog.',
       plainto_tsquery('english', 'red dog')) AS my_headline,
   'dog' AS my_query,
   'The quick brown fox jumped over the lazy dog.'
        @@ plainto_tsquery('english', 'dog') AS is_there_a_match
  WHERE 'The quick brown fox jumped over the lazy dog.'
       @@ plainto_tsquery('english', 'dog')
  ;
my_headline my_query is_there_a_match
The quick brown fox jumped over the lazy dog. dog true

If you still want to have all records displayed, even those not matching, by not using WHERE, then you can use a sequence like:

CASE WHEN my_document @@ plainto_tsquery('english', 'red dog')
       THEN ts_headline(my_document, plainto_tsquery('english', 'red dog'))
     ELSE my_document
END AS my_headline

A related self-contained query (that you can test without having a table) is:

SELECT CASE WHEN 'some red dogs' @@ plainto_tsquery('english', 'red dog')
       THEN ts_headline('some red dogs', plainto_tsquery('english', 'red dog'))
     ELSE 'some red dogs'
END AS my_headline
UNION ALL
SELECT CASE WHEN 'some dogs' @@ plainto_tsquery('english', 'red dog')
       THEN ts_headline('some dogs', plainto_tsquery('english', 'red dog'))
     ELSE 'some dogs'
END AS my_headline
;

which returns what you'd expect (based on your question):

my_headline
some red dogs
some dogs
Hoxie answered 19/10 at 8:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.