PostgreSQL: Full Text Search - How to search partial words?
Asked Answered
T

6

52

Following a question posted here about how I can increase the speed on one of my SQL Search methods, I was advised to update my table to make use of Full Text Search. This is what I have now done, using Gist indexes to make searching faster. On some of the "plain" queries I have noticed a marked increase which I am very happy about.

However, I am having difficulty in searching for partial words. For example I have several records that contain the word Squire (454) and I have several records that contain Squirrel (173). Now if I search for Squire it only returns the 454 records but I also want it to return the Squirrel records as well.

My query looks like this

SELECT title 
FROM movies 
WHERE vectors @@ to_tsoquery('squire');

I thought I could do to_tsquery('squire%') but that does not work.
How do I get it to search for partial matches ?

Also, in my database I have records that are movies and others that are just TV Shows. These are differentiated by the "" over the name, so like "Munsters" is a TV Show, whereas The Munsters is the film of the show. What I want to be able to do is search for just the TV Show AND just the movies. Any idea on how I can achieve this ?

Regards Anthoni

Timmons answered 25/3, 2010 at 6:32 Comment(1)
If you have the searchkey squire but want to get the result squirrel, you might have to specify additional constraints. Because otherwise one could argue they had the search key mama but wanted the result rabbit. So perhaps you might want to slice your search key and turn squire into s | sq | squ | squi | squir | squire... This or fancier algorithms would get you the squirrel. I think @Joshua Burns's answer contains a more generic solution than mine though, if you want to be generic.Sycamore
D
6

Even using LIKE you will not be able to get 'squirrel' from squire% because 'squirrel' has two 'r's. To get Squire and Squirrel you could run the following query:

SELECT title FROM movies WHERE vectors @@ to_tsquery('squire|squirrel');

To differentiate between movies and tv shows you should add a column to your database. However, there are many ways to skin this cat. You could use a sub-query to force postgres to first find the movies matching 'squire' and 'squirrel' and then search that subset to find titles that begin with a '"'. It is possible to create indexes for use in LIKE '"%...' searches.

Without exploring other indexing possibilities you could also run these - mess around with them to find which is fastest:

SELECT title 
FROM (
   SELECT * 
   FROM movies 
   WHERE vectors @@ to_tsquery('squire|squirrel')
) t
WHERE title ILIKE '"%';

or

SELECT title 
FROM movies 
WHERE vectors @@ to_tsquery('squire|squirrel') 
  AND title ILIKE '"%';
Ditty answered 25/3, 2010 at 13:43 Comment(0)
M
100

Try,

SELECT title FROM movies WHERE to_tsvector(title) @@ to_tsquery('squire:*')

This works on PostgreSQL 8.4+

Melodic answered 9/8, 2010 at 19:32 Comment(7)
You've specified a lexeme with prefix matching, but it won't solve the problem: it is still missing an 'r'. You should probably delete this answer.Adaurd
@RichardMichael I disagree because this method works. The OP is trying to get 2 words that aren't similar. squire is not a partial of the word squirrel. He asked for a partial match and this answer does that. It should be upvoted.Borderer
Thanks for this, helped in a use case I have. +1Wesla
Thanks this solved my problem of partial match. Where can I find documentation which led you to append :*Calomel
Despite this answer being 8 years ago, I'd still like to know: What if I want to search for quir? I.e., having a wildcard both and end and beginning of the search term.Quietism
@Calomel I was also curious as to where the documentation is for tsquery wildcards and found it here: postgresql.org/docs/current/…Mistakable
How can we write in Django, I tried something like movies.objects.filter(title__search='squire:*') but it is now working?Barratry
S
34

Anthoni,

Assuming you plan on using only ASCII encoding (could be difficult, I'm aware), a very viable option may be the Trigram (pg_trgm) module: http://www.postgresql.org/docs/9.0/interactive/pgtrgm.html

Trigram utilizes built-in indexing methods such as Gist and Gin. The only modification you have to make is when defining your index, specify an Operator Class of either gist_trgm_ops or gin_trgm_ops.

If the contrib modules aren't already installed, in Ubuntu it's as easy and running the following command from the shell:

# sudo apt-get install postgresql-contrib

After the contrib modules are made available, you must install the pg_trgm extension into the database in question. You do this by executing the following PostgreSQL query on the database you wish to install the module into:

CREATE EXTENSION pg_trgm;

After the pg_trgm extension has been installed, we're ready to have some fun!

-- Create a test table.
CREATE TABLE test (my_column text)
-- Create a Trigram index.
CREATE INDEX test_my_colun_trgm_idx ON test USING gist (my_column gist_trgm_ops);
-- Add a couple records
INSERT INTO test (my_Column) VALUES ('First Entry'), ('Second Entry'), ('Third Entry')
-- Query using our new index --
SELECT my_column, similarity(my_column, 'Frist Entry') AS similarity FROM test WHERE my_column % 'Frist Entry' ORDER BY similarity DESC
Salve answered 16/2, 2012 at 22:20 Comment(2)
the similarity in your example uses the perfect word and not the mis-spelled word that is used in your where clause. select similarity('Frist Entry', 'First Entry') => 0.5Argumentation
good point, typo on my end. resolved. thanks for the heads up :)Salve
F
16

@alexander-mera solution works great!

Note: Also make sure to convert spaces to +. For example, if you are searching for squire knight.

SELECT title FROM movies WHERE to_tsvector(title) @@ to_tsquery('squire+knight:*')
Frolic answered 20/11, 2012 at 20:48 Comment(4)
Using the '+' doesn't work for me on PosgreSQL 9.4.1. If instead I use '&', works like a charm.Vapory
if you double quote the tsquery to '"squire+knight":*', it expands to 'squire':* & 'knight':*. you can do the same using to_tsquery('"squire":* & "knight":*). depending on your approach to generating search terms, you may not want to do a wildcard prefix match on every term.Sachsse
Absolute lifesaver with the comment about adding "+" for spaces. Exactly what my issue was. Thank you sir!Chism
Thank you @facundofarias. That seems to be the correct answer.Bouton
D
6

Even using LIKE you will not be able to get 'squirrel' from squire% because 'squirrel' has two 'r's. To get Squire and Squirrel you could run the following query:

SELECT title FROM movies WHERE vectors @@ to_tsquery('squire|squirrel');

To differentiate between movies and tv shows you should add a column to your database. However, there are many ways to skin this cat. You could use a sub-query to force postgres to first find the movies matching 'squire' and 'squirrel' and then search that subset to find titles that begin with a '"'. It is possible to create indexes for use in LIKE '"%...' searches.

Without exploring other indexing possibilities you could also run these - mess around with them to find which is fastest:

SELECT title 
FROM (
   SELECT * 
   FROM movies 
   WHERE vectors @@ to_tsquery('squire|squirrel')
) t
WHERE title ILIKE '"%';

or

SELECT title 
FROM movies 
WHERE vectors @@ to_tsquery('squire|squirrel') 
  AND title ILIKE '"%';
Ditty answered 25/3, 2010 at 13:43 Comment(0)
H
5

The broad solution to this is to use PG's ts_rewrite function to setup an aliases table that works for alternate matches (see Query Rewriting). This covers cases like yours above while also handling completely different cases like searching for tree rat and getting results for squirrel, etc.

Full details and explanation at that link, but the gist of it is that you can setup an aliases table with 2 ts_query columns and pass a query of that table to in with your search, like so:

CREATE TABLE aliases (t tsquery primary key, s tsquery);
INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn'));

SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases');

Resulting in a final query that looks more like:

WHERE vectors @@ ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases')

This is similar to the thesaurus setup within PG but works without requiring a full reindex every time you add something. As you come across little spelling variations and cases of "when I search for this I expect results like this" it's very easy to just add them to the table real quick. You can add more columns to that table as well as long as the query based to ts_rewrite returns the 2 expected to_tsquery columns.

When you dig into that documentation you'll see suggested examples for performance tuning as well. There's a balance between using trigram for pure speed and using vector/query/rewrite for robustness.

Hervey answered 5/7, 2016 at 16:17 Comment(0)
B
0

One thing that may work is break the word you are searching for into smaller parts. So you could look for things that have squi or quir or squire or etc... I'm not sure how efficient that would be though, but it may help.

When you search for the film or movie you could try placing the text in the single quote. so it would be either 'show' or '"show"'. I think that could also work.

Berte answered 25/3, 2010 at 14:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.