Fastest full text search today?
Asked Answered
G

3

6

spoiler :
This is just another Lucene vs Sphinx vs whatever,
I saw that all other threads were almost two years old, so decided to start again..

Here is the requirement :

data size : max 10 GB.
rows : nearly billions
indexing should be fast
searching should be under 0 ms [ ok, joke... laugh... but keep this as low as possible ]

In today's world, which/what/how do I go about it ?

edit : I did some timing on lucene, and for indexing 1.8gb data, it took 5 minutes.
searching is pretty fast, unless I do a a*. a* takes 400 ~ 500 ms.
My biggest worry is indexing, which is taking loooonnnnggg time, and lot of resources!!

Grory answered 23/2, 2011 at 13:55 Comment(1)
you only have to do index on new data, updated data, deleted data, not always the whole collectionTriennium
G
2

I have no experience other than with Lucene - it's pretty much the default indexing solution so don't think you can go too wrong.

10GB is not a lot of data. You'll be able to re-index it pretty rapidly - or keep it on SSDs for extra speed. And of course keep your whole index in RAM (which Lucene supports) for super-fast lookups.

Geld answered 23/2, 2011 at 14:0 Comment(4)
I'm going to keep everything on clouds, so I don't see anyone giving SSD like speeds there :( And btw, whole data on RAM, I can't take it for the app I am working on... It'd be like 1000 GB of unique data per computer, so everything can't be brought into memory...Grory
OK - well the SSDs will only make diff wrt to building the index. BUt confused - you said max data size 10GB, not 1000?Geld
Lol :D true, not 1000 GB :) its only 10 GB... Check the edits now :)Grory
well, its not that simple, which for certain reasons I didn't specify in the post... There are going to be multiple indexes of 10 gb each... and there'll be multiple searchers going for every different index.. then how does this work ? that was my point... sorry for the confusion, if it was only 10 GB, you are 100% right...Grory
P
0

Please check Lucene wiki for tips on improving Lucene indexing speed. This is quite succinct. In general, Lucene is quite fast (it is used for real-time search.) The tips will be handy to figure out if you are missing out on something "obvious."

Purcell answered 23/2, 2011 at 17:3 Comment(4)
I've done everything "obvious" by now :) just wanted to know if "this" IS the way to go :) And btw, is the indexing time allright ? its 5 minutes to 1.8GB ?Grory
Size is somewhat inaccurate metric. Indexing 1.8G of plain text will be different from indexing 1.8G HTML (which you will parse and index extracted text.) You need to see, if that is "fast enough" for your needs. If existing indexing speed falls short of your expectations, you may wish to explore how to use Lucene in real-time environment. That is non-trivial.Purcell
@Grory - your indexing speed is limited by how fast you can read off disk, and how much that data needs to be processed before index insertion.Geld
@Richard : Agreed.. There is just a few String manipulations done before inserting, that is adding to the time too... I will try to reduce manipulations, but just wanted to be sure if there is a way to speed up lucene more..Grory
C
0

My biggest worry is indexing, which is taking loooonnnnggg time, and lot of resources!!

Take a look at Lusql, we used it once, FWIW 100 GBdata from mysql on a decent machine took little more than an hour to index, on filesystem(NTFS)

Now if u add SSD or whatever ultra fast disk tecnnology, you can bring it down considerably

Carlist answered 28/2, 2011 at 5:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.