Performance of LIKE queries on multmillion row tables, MySQL
Asked Answered
S

5

26

From anybody with real experience, how do LIKE queries perform in MySQL on multi-million row tables, in terms of speed and efficiency, if the field has a plain INDEX?

Is there a better alternative (that doesn't filter results out, like the FULLTEXT 50% rule) for perform database field searches on multi-million row tables?

EXAMPLE:

Schema (comments table)

id (PRIMARY) title(INDEX) content time stamp

Query

SELECT * FROM 'comments' WHERE 'title' LIKE '%query%'
Salcedo answered 10/7, 2012 at 17:44 Comment(3)
Please give an example of the LIKE clause in question and the schema.Wadley
having the wild card at the beginning makes it worse. example: name like '%jim%'Chinachinaberry
updated With example query and schemaSalcedo
L
16

From anybody with real experience, how do LIKE queries perform in MySQL on multimillion row tables, in terms of speed and effiency, if the field has a plain INDEX?

Not so well (I think I had some searches in the range of 900k, can't say I have experience in multimillion row LIKEs).

Usually you should restrict the search any way you can, but this depends on table structure and application use case.

Also, in some Web use cases it's possible to actually improve performances and user experience with some tricks, like indexing separate keywords and create a keyword table and a rows_contains_keyword (id_keyword, id_row) table. The keyword table is used with AJAX to suggest search terms (simple words) and to compile them to integers -- id_keywords. At that point, finding the rows containing those keywords becomes really fast. Updating the table one row at a time is also quite performant; of course, batch updates become a definite "don't".

This is not so unlike what is already done by full text MATCH..IN BOOLEAN MODE if using only the + operator:

SELECT * FROM arts WHERE MATCH (title) AGAINST ('+MySQL +RDBMS' IN BOOLEAN MODE);

You probably want an InnoDB table to do that:

Boolean full-text searches have these characteristics:

  • They do not automatically sort rows in order of decreasing relevance. ...
  • InnoDB tables require a FULLTEXT index on all columns of the MATCH() expression to perform boolean queries. Boolean queries against a MyISAM search index can work even without a FULLTEXT index, although a search executed in this fashion would be quite slow. ...
  • They do not use the 50% threshold that applies to MyISAM search indexes.

Can you give more information on the specific case?

update: the AJAX way

Setup: you break up all titles into words. This will soon give you a title_words table ( id integer not null autoincrement, word varchar(50) ) and a large title_contains_word ( word_id integer, title_id integer ) table.

If you have 10 million titles, with an average of four words (plausible for books, less so for papers), you can expect a five thousand-row title_words table and a forty-million table containing two INTEGER columns; that is around 400 MB of extra data.

For the search, the user starts entering a word, which you can autocomplete from the titlewords. Once this is done, the query becomes a list of word IDs; and of course words that aren't in any title cannot even be entered, so the negative result is given immediately, and for free.

The actual search can now happen in several ways, but one that I like has a SELECT COUNT(*) FROM title_contains_word WHERE word_id={id} running after each user's selection, before the real search is started.

This allows building a composite query or a common table expression starting from the rarest words. Indeed, if any word has a count below, say, 20, you can SELECT all those (on average) eight TCW rows and get the IDs of all their related words, then simply verify (outside MySQL) that there is a title ID such that there exists a pair (titleID, wordID) for all the wordIDs of your query.

Even if you have to resort to the roughest possible form,

SELECT a.title_id 
FROM title_contains_word AS tcw1
JOIN title_contains_word AS tcw2 USING (title_id)
JOIN title_contains_word AS tcw3 USING (title_id)
JOIN title_contains_word AS tcw4 USING (title_id)
...
WHERE (tcw1.word_id = {id1})
  AND (tcw2.word_id = {id2})
  ...

the JOIN will be made from very small virtually-buffered tables that will take very little time to scan.

Once you have all the relevant title IDs, then you can run a straight SELECT from the multimillion-row large DB using the primary key title_id. This last search should also be blazing fast.

Linnlinnaeus answered 10/7, 2012 at 17:56 Comment(3)
hmm...very interesting idea, will try it out!Salcedo
Is there a tutorial somewhere for this approach? I wish to read it in more detail, pitfalls, etc. Thanks.Langbehn
Possibly, but I don't have anything handy at the moment. I'll try and come up with something tonight. Your best bet however is to hit Google and look for some real-world experiences.Linnlinnaeus
M
18

LIKE will do a full table scan if you have a % at the start of the pattern.

You can use FULLTEXT in Boolean (rather than natural language) mode to avoid the 50% rule.

Boolean full-text searches have these characteristics:

They do not use the 50% threshold.

http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html

Mesdemoiselles answered 10/7, 2012 at 17:52 Comment(1)
Very good answer and @Salcedo I would consider changing the answer to this question. FULLTEXT should be used to solve this problem.Thaddeusthaddus
L
16

From anybody with real experience, how do LIKE queries perform in MySQL on multimillion row tables, in terms of speed and effiency, if the field has a plain INDEX?

Not so well (I think I had some searches in the range of 900k, can't say I have experience in multimillion row LIKEs).

Usually you should restrict the search any way you can, but this depends on table structure and application use case.

Also, in some Web use cases it's possible to actually improve performances and user experience with some tricks, like indexing separate keywords and create a keyword table and a rows_contains_keyword (id_keyword, id_row) table. The keyword table is used with AJAX to suggest search terms (simple words) and to compile them to integers -- id_keywords. At that point, finding the rows containing those keywords becomes really fast. Updating the table one row at a time is also quite performant; of course, batch updates become a definite "don't".

This is not so unlike what is already done by full text MATCH..IN BOOLEAN MODE if using only the + operator:

SELECT * FROM arts WHERE MATCH (title) AGAINST ('+MySQL +RDBMS' IN BOOLEAN MODE);

You probably want an InnoDB table to do that:

Boolean full-text searches have these characteristics:

  • They do not automatically sort rows in order of decreasing relevance. ...
  • InnoDB tables require a FULLTEXT index on all columns of the MATCH() expression to perform boolean queries. Boolean queries against a MyISAM search index can work even without a FULLTEXT index, although a search executed in this fashion would be quite slow. ...
  • They do not use the 50% threshold that applies to MyISAM search indexes.

Can you give more information on the specific case?

update: the AJAX way

Setup: you break up all titles into words. This will soon give you a title_words table ( id integer not null autoincrement, word varchar(50) ) and a large title_contains_word ( word_id integer, title_id integer ) table.

If you have 10 million titles, with an average of four words (plausible for books, less so for papers), you can expect a five thousand-row title_words table and a forty-million table containing two INTEGER columns; that is around 400 MB of extra data.

For the search, the user starts entering a word, which you can autocomplete from the titlewords. Once this is done, the query becomes a list of word IDs; and of course words that aren't in any title cannot even be entered, so the negative result is given immediately, and for free.

The actual search can now happen in several ways, but one that I like has a SELECT COUNT(*) FROM title_contains_word WHERE word_id={id} running after each user's selection, before the real search is started.

This allows building a composite query or a common table expression starting from the rarest words. Indeed, if any word has a count below, say, 20, you can SELECT all those (on average) eight TCW rows and get the IDs of all their related words, then simply verify (outside MySQL) that there is a title ID such that there exists a pair (titleID, wordID) for all the wordIDs of your query.

Even if you have to resort to the roughest possible form,

SELECT a.title_id 
FROM title_contains_word AS tcw1
JOIN title_contains_word AS tcw2 USING (title_id)
JOIN title_contains_word AS tcw3 USING (title_id)
JOIN title_contains_word AS tcw4 USING (title_id)
...
WHERE (tcw1.word_id = {id1})
  AND (tcw2.word_id = {id2})
  ...

the JOIN will be made from very small virtually-buffered tables that will take very little time to scan.

Once you have all the relevant title IDs, then you can run a straight SELECT from the multimillion-row large DB using the primary key title_id. This last search should also be blazing fast.

Linnlinnaeus answered 10/7, 2012 at 17:56 Comment(3)
hmm...very interesting idea, will try it out!Salcedo
Is there a tutorial somewhere for this approach? I wish to read it in more detail, pitfalls, etc. Thanks.Langbehn
Possibly, but I don't have anything handy at the moment. I'll try and come up with something tonight. Your best bet however is to hit Google and look for some real-world experiences.Linnlinnaeus
S
8

I recommend you to restrict your query by other clauses also (date range for example), because a LIKE '%something' guarantees you a full table scan

Side answered 10/7, 2012 at 17:47 Comment(1)
in otherwords like '&...' and "multimillion row tables," don't mix in MySQL :-(Emersen
E
0

With Workbench, use EXPLAIN before your SELECT to test different conditions use of LIKE, with and without INDEX, with wildcard in different parts of your search term. You will get your own conclusion based on your tests, because each case is a specific case.

Eckart answered 5/12, 2019 at 10:56 Comment(0)
W
0

you could do a Subselect in order to get just the most recent registers.

select s.* from (select * from my_table order by "create" desc  limit 10) as s
where   s.event like '%status%'   
Westonwestover answered 31/5, 2021 at 16:24 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.