Fulltext Search with InnoDB
Asked Answered
W

9

94

I'm developing a high-volume web application, where part of it is a MySQL database of discussion posts that will need to grow to 20M+ rows, smoothly.

I was originally planning on using MyISAM for the tables (for the built-in fulltext search capabilities), but the thought of the entire table being locked due to a single write operation makes me shutter. Row-level locks make so much more sense (not to mention InnoDB's other speed advantages when dealing with huge tables). So, for this reason, I'm pretty determined to use InnoDB.

The problem is... InnoDB doesn't have built-in fulltext search capabilities.

Should I go with a third-party search system? Like Lucene(c++) / Sphinx? Do any of you database ninjas have any suggestions/guidance? LinkedIn's zoie (based off Lucene) looks like the best option at the moment... having been built around realtime capabilities (which is pretty critical for my application.) I'm a little hesitant to commit yet without some insight...

(FYI: going to be on EC2 with high-memory rigs, using PHP to serve the frontend)

Woodrum answered 4/9, 2009 at 19:42 Comment(1)
InnoDB now supports full text search as of 5.6Slowly
K
50

I can vouch for MyISAM fulltext being a bad option - even leaving aside the various problems with MyISAM tables in general, I've seen the fulltext stuff go off the rails and start corrupting itself and crashing MySQL regularly.

A dedicated search engine is definitely going to be the most flexible option here - store the post data in MySQL/innodb, and then export the text to your search engine. You can set up a periodic full index build/publish pretty easily, and add real-time index updates if you feel the need and want to spend the time.

Lucene and Sphinx are good options, as is Xapian, which is nice and lightweight. If you go the Lucene route don't assume that Clucene will better, even if you'd prefer not to wrestle with Java, although I'm not really qualified to discuss the pros and cons of either.

Kavanaugh answered 4/9, 2009 at 22:50 Comment(6)
Solr (based on Lucene) can scale hugely and its very powerful and flexible. We have employed Solr (specifically the LucidWorks for Solr edition) and I can say its been a huge win. Sphinx has some serious promise as well but ultimately its lack of datatypes can be troubling, for our application at least. Sphinx is very fast and if it fits your needs is a solid choice as well.Costume
Thanks a bunch you two; great responses. I've been thumbing through Solr's docs, and, that seems like a great solution to go with. It powers quite a few huge websites too, I see. I think Solr's the ticket. Thanks guys. Also, it's good to learn of your MyISAM headaches, Ian... those will be good to have in mind in the future. On other projects, I'll stray away from ever trying to use the fulltext feature.Woodrum
Was wondering what made Ian say "don't assume that Clucene will better"? as one of the clucene core team I might not be so objective, but to me it seems optimized C++ port of any Java library will boost it's performance up through the roof. I'd recommend anyone not to post such comments without having at least a glance at the product they are dishonoring.Percutaneous
When you slam MyISAM, you really need to be more specific. "Off the rails" is very vague, and may have been because a single bug in the build you were using, possibly since fixed.Slowly
But what if you don't have the option to install software on the server - what alternatives do exist in this case?Icecold
@synhershko: Common knowledge among professionals is that porting a Java application to C++ will usually not "boost its performance through the roof;" a direct translation, using idomatic C++, will typically give a very modest boost (around 30-50%)Osculum
P
59

Along with the general phasing out of MyISAM, InnoDB full-text search (FTS) is finally available in MySQL 5.6.4 release.

Lots of juicy details at https://dev.mysql.com/doc/refman/5.6/en/innodb-fulltext-index.html.

While other engines have lots of different features, this one is InnoDB, so it's native (which means there's an upgrade path), and that makes it a worthwhile option.

Phraseograph answered 22/2, 2012 at 14:49 Comment(1)
Article link is 403 forbiddenCivil
K
50

I can vouch for MyISAM fulltext being a bad option - even leaving aside the various problems with MyISAM tables in general, I've seen the fulltext stuff go off the rails and start corrupting itself and crashing MySQL regularly.

A dedicated search engine is definitely going to be the most flexible option here - store the post data in MySQL/innodb, and then export the text to your search engine. You can set up a periodic full index build/publish pretty easily, and add real-time index updates if you feel the need and want to spend the time.

Lucene and Sphinx are good options, as is Xapian, which is nice and lightweight. If you go the Lucene route don't assume that Clucene will better, even if you'd prefer not to wrestle with Java, although I'm not really qualified to discuss the pros and cons of either.

Kavanaugh answered 4/9, 2009 at 22:50 Comment(6)
Solr (based on Lucene) can scale hugely and its very powerful and flexible. We have employed Solr (specifically the LucidWorks for Solr edition) and I can say its been a huge win. Sphinx has some serious promise as well but ultimately its lack of datatypes can be troubling, for our application at least. Sphinx is very fast and if it fits your needs is a solid choice as well.Costume
Thanks a bunch you two; great responses. I've been thumbing through Solr's docs, and, that seems like a great solution to go with. It powers quite a few huge websites too, I see. I think Solr's the ticket. Thanks guys. Also, it's good to learn of your MyISAM headaches, Ian... those will be good to have in mind in the future. On other projects, I'll stray away from ever trying to use the fulltext feature.Woodrum
Was wondering what made Ian say "don't assume that Clucene will better"? as one of the clucene core team I might not be so objective, but to me it seems optimized C++ port of any Java library will boost it's performance up through the roof. I'd recommend anyone not to post such comments without having at least a glance at the product they are dishonoring.Percutaneous
When you slam MyISAM, you really need to be more specific. "Off the rails" is very vague, and may have been because a single bug in the build you were using, possibly since fixed.Slowly
But what if you don't have the option to install software on the server - what alternatives do exist in this case?Icecold
@synhershko: Common knowledge among professionals is that porting a Java application to C++ will usually not "boost its performance through the roof;" a direct translation, using idomatic C++, will typically give a very modest boost (around 30-50%)Osculum
T
11

You should spend an hour and go through installation and test-drive of Sphinx and Lucene. See if either meets your needs, with respect to data updates.

One of the things that disappointed me about Sphinx is that it doesn't support incremental inserts very well. That is, it's very expensive to reindex after an insert, so expensive that their recommended solution is to split your data into older, unchanging rows and newer, volatile rows. So every search your app does would have to search twice: once on the larger index for old rows and also on the smaller index for recent rows. If that doesn't integrate with your usage patterns, this Sphinx is not a good solution (at least not in its current implementation).

I'd like to point out another possible solution you could consider: Google Custom Search. If you can apply some SEO to your web application, then outsource the indexing and search function to Google, and embed a Google search textfield into your site. It could be the most economical and scalable way to make your site searchable.

Trishatriskelion answered 4/9, 2009 at 23:28 Comment(1)
Thanks, Bill. Yeah, the Sphinx documentation had me wavering a bit about how it handles the index updates. Good to get it confirmed. That sort of system would probably turn into a nightmare for me, I imagine. As for Google Custom Search, that's an option. However, my main problem with that is just the non-realtime index and lack of customization. Styling the results and pulling additional data will be fairly crucial to me. Thanks for chiming in though---the Sphinx info is certainly good to know!Woodrum
S
3

Perhaps you shouldn't dismiss MySQL's FT so quickly. Craigslist used to use it.

MySQL’s speed and Full Text Search has enabled craigslist to serve their users .. craigslist uses MySQL to serve approximately 50 million searches per month at a rate of up to 60 searches per second."

edit

As commented below, Craigslist seems to have switched to Sphinx some time in early 2009.

Slowly answered 15/12, 2010 at 1:53 Comment(2)
The article I linked doesn't mention Sphinx, and Nik doesn't cite any source saying Craigslist uses Sphinx at allSlowly
The case study PDF looks like from 2004, at which time there were 50M searches per month. Sphinx page states 50M searches per day, which probably explains the reason they switched to a dedicated search solution.Otha
R
1

Sphinx, as you point out, is quite nice for this stuff. All the work is in the configuration file. Make sure whatever your table is with the strings has some unique integer id key, and you should be fine.

Ratepayer answered 4/9, 2009 at 19:48 Comment(0)
J
0

try this

ROUND((LENGTH(text) - LENGTH(REPLACE(text, 'serchtext', ''))) / LENGTH('serchtext'),0)!=0
Jacquelyn answered 12/5, 2011 at 6:3 Comment(0)
N
0

You should take a look at Sphinx. It is worth a try. It's indexing is super fast and it is distributed. You should take a look at this (http://www.percona.com/webinars/2012-08-22-full-text-search-throwdown) webminar. It talks about searching and has some neat benchmarks. You may find it helpful.

Nipping answered 18/9, 2012 at 17:21 Comment(0)
A
0

If everything else fails, there's always soundex_match, which sadly isn't really fast an accurate

Agreeable answered 18/3, 2013 at 14:33 Comment(0)
F
0

For anyone stuck on an older version of MySQL / MariaDB (i.e. CentOS users) where InnoDB doesn't support Fulltext searches, my solution when using InnoDB tables was to create a separate MyISAM table for the thing I wanted to search.

For example, my main InnoDB table was products with various keys and referential integrity. I then created a simple MyISAM table called product_search containing two fields, product_id and product_name where the latter was set to a FULLTEXT index. Both fields are effectively a copy of what's in the main product table.

I then search on the MyISAM table using fulltext, and do an inner join back to the InnoDB table.

The contents of the MyISAM table can be kept up-to-date via either triggers or the application's model.

I wouldn't recommend this if you have multiple tables that require fulltext, but for a single table it seems like an adequate work around until you can upgrade.

Fitzpatrick answered 2/6, 2020 at 7:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.