Postgres Select ILIKE %text% is Slow On Large String Rows
Asked Answered
G

2

5

I have a table which have only 7 columns and one of that column stores long text data for every row. Average character length for that text column data is approximately 1500 characters. And this table got 500.000 rows.

When I use a select query and not take that text column with it, there is no problem, query takes 10 secs as expected.

But if I add this long text column to my query like Select * from table_1 it takes 3 or 4 minutes to complete this query and fill datatable with dataadapter.

And why I need to lookup all that long text column records? Because I need to use text filter on it like:

SELECT * 
FROM table_1 
WHERE longtextcolumn ILIKE ANY (ARRAY['%texttosearch1%', '%texttosearch2%'])

What should I do to speed up that progress? Table partitioning could solve this speed issue? Or should I look for indexing?

Gatefold answered 16/9, 2021 at 23:59 Comment(3)
Searching a large text field is going to be slow without an appropriate index. What index do you have on there?Sphagnum
You'd want to look into enabling full text search for longtextcolumn using tsvector column or expression index.Hypabyssal
Sorry but I don't have enough information about indexes. In this table I got a row_id column which is primary key. Does that count as an index ? :) Or should I change definitions of that large text column. Current definition settings of that columns is like that: Data type: text Storage:ExtendedGatefold
G
11

After reading all answers in this post I started to learn about indexes and especially gin indexing on Postgres. And today I overcome my problem. Before using gin indexing method my query takes 8 mins to complete. Now same query takes 50 ms. It is a dramatic performance difference and I want to explain what I did step by step for the community.

EXAMPLE

Let's assume we have a table named table_1 and this table has a column named long_text. That long_text column stores long text data like strings which have length of 1500 characters. And this table_1 has 800.000 rows.

STEP 1

Postgres has a datatype which named as ts_vector. This ts_vector data type takes your long text and calculate the single distinct words of it(language based) and index that words in it. So we need to create a ts_vector column on our table_1 and convert our long_text column and populate ts_vector data from it. This will be like:

ALTER table table_1 
ADD COLUMN long_text_tsv TS_VECTOR;

STEP 2

Populate ts_vector data from our long_text column:

UPDATE table_1 t1
SET long_text_tsv = to_tsvector('english', t1.long_text)
FROM table_1 t2;

STEP 3

Create gin index of that newly populated long_text_tsv column

CREATE INDEX tsv_index
ON table_1
USING gin(long_text_tsv);

After that you are ready to filter your long text data using that newly created and indexed tsv data

STEP 4

With tsvector data, your search query should be like this:

SELECT long_text FROM table_1
WHERE long_text_tsv @@ to_tsquery('john<->lennon&music');

ts_vector data should be searched by ts_query data type. In above query <-> means followed by, & means 'AND' operator.

With this method I explained, my search query speeds up 100 times faster than the old one.

I am not sure if I do this all in the right way but I assume I am doing ok because everything is looking fine in my program now. If something is wrong in that answer, please warn me about that.

Gatefold answered 17/9, 2021 at 21:4 Comment(1)
TS_VECTOR isn't a supported type in the versions of Postgres I've used. The type that works for me is TSVECTOR.Myotonia
P
6

postgreSQL WHERE clauses containing col LIKE '%text%' can be supported by so-called trigram indexes.

In your case this trigram index will probably help.

CREATE INDEX CONCURRENTLY table1_text
     ON table1
  USING GIN (longtextcolumn gin_trgm_ops)
INCLUDE (col1, col2, col3);

You can omit the INCLUDE clause, but if you include the columns you need from your query it will be a covering index and speed your query even more.

Notice that ILIKE runs slower than LIKE, but still uses the trigram index.

Pro tip For performance avoid SELECT * if you can. Instead name the columns you want.

This sort of GIN index is in a postgreSQL extension. You may need to load it before using it. It's possible the person who manages your server will need to load it.

CREATE EXTENSION IF NOT EXISTS pg_trgm;
Paunchy answered 17/9, 2021 at 12:51 Comment(1)
Thanks a lot for your answer. I shared my way in above answer. Can you explain it and tell me if I am doing everything right. And can you say that the method you suggested is more accurate and faster or should I use my own method ?Gatefold

© 2022 - 2024 — McMap. All rights reserved.