Building a full text search engine: where to start [closed]
Asked Answered
J

13

13

I want to write a web application using Google App Engine (so the reference language would be Python). My application needs a simple search engine, so the users would be able to find data specifying keywords.

For example, if I have one table with those rows:

1 Office space
2 2001: A space odyssey
3 Brazil

and the user queries for "space", rows 1 and 2 would be returned. If the user queries for "office space", the result should be rows 1 and 2 too (row 1 first).

What are the technical guidelines/algorithms to do this in a simple way?
Can you give me good pointers to the theory behind this?

Thanks.

Edit: I'm not looking for anything complex here (say, indexing tons of data).

Jarrettjarrid answered 6/10, 2008 at 21:10 Comment(0)
S
4

I would not build it yourself, if possible.

App Engine includes the basics of a Full Text searching engine, and there is a great blog post here that describes how to use it.

There is also a feature request in the bug tracker that seems to be getting some attention lately, so you may want to hold out, if you can, until that is implemented.

Sunwise answered 7/10, 2008 at 2:30 Comment(1)
I've tried it, and it works, even if has some glitches; for example, it doesn't seem to support queries shorter than 3 chars. Anyway, a good compromise.Jarrettjarrid
F
8

Read Tim Bray's series of posts on the subject.

  • Background
  • Usage of search engines
  • Basics
  • Precision and recall
  • Search engne intelligence
  • Tricky search terms
  • Stopwords
  • Metadata
  • Internationalization
  • Ranking results
  • XML
  • Robots
  • Requirements list
Friesian answered 7/10, 2008 at 2:6 Comment(0)
D
7

I found these two books very useful when I used to build full-text search engines.

Information Retrieval

Managing Gigabytes

Ducan answered 6/10, 2008 at 21:14 Comment(1)
I haven't read/seen Information Retrieval, but I have used Managing Gigabytes, and quite aside from covering basically everything that is necessary to make a full text search and retrieval system it is actually very readable.Whit
S
4

I would not build it yourself, if possible.

App Engine includes the basics of a Full Text searching engine, and there is a great blog post here that describes how to use it.

There is also a feature request in the bug tracker that seems to be getting some attention lately, so you may want to hold out, if you can, until that is implemented.

Sunwise answered 7/10, 2008 at 2:30 Comment(1)
I've tried it, and it works, even if has some glitches; for example, it doesn't seem to support queries shorter than 3 chars. Anyway, a good compromise.Jarrettjarrid
E
3

Lucene or Autonomy! These are not out of the box solutions for you. You will have to write wrappers on top of their interfaces.
They certainly do take care of the stemming, grammar , relational operators etc

Extravagancy answered 6/10, 2008 at 21:10 Comment(0)
V
3

As always start in wikipedia. First start is usually building an inverted index.

Vanderbilt answered 6/10, 2008 at 21:21 Comment(0)
L
3

Here's an original idea:

Don't build an index. Seriously.

I was faced with a similar progblem some time ago. I needed a fast method to search megs and megs of text that came from documentation. I needed to match not just words, but word proximity in large documents (is this word near that word). I just ended up writing it in C, and the speed of it surprised me. It was fast enough that it didn't need any optimizing or indexing.

With the speed of today's computers, if you write code that runs straight on the metal (compiled code), you often don't need an order log(n) type algorithm to get the performance you need.

Lunate answered 6/10, 2008 at 22:58 Comment(1)
This appeals to my don't over optimize sensibilities. Up modded.Skip
I
1

First build your index. Go through the input, split into words
For each word check if it is already in the index, if it is add the current record number to the index list, if not add the word and record number.
To look up a word go to the (possibly sorted) index and return all the record numbers for that word.
It's very esy to do this for a reasoable size list using Python's builtin storage types.

As an extra refinement you only want to store the base part of a word, eg 'find' for 'finding' - look up stemming algorithms.

Ionone answered 6/10, 2008 at 21:59 Comment(3)
If by "index" you mean "mapping" or "dictionary", then this is the job. If you do mean that, then take out the word "sorted" since the Python dictionary is a hash from key to value.Hawthorn
I meant index in the general (book/DB) sense. What container in python is best to implement it is a separate question.Ionone
Other than a Python dictionary, there aren't any other choices. Hence the suggestion that you consider dropping "sorted".Hawthorn
B
1

The book Introduction to Information Retrieval provides a good introduction to the field.

A dead-tree version is published by Cambridge University Press, but you can also find a free online edition (in HTML and PDF) following the link above.

Bobsledding answered 8/10, 2008 at 1:37 Comment(0)
S
0

See also a question I asked: How-to: Ranking Search Results.

Surely there are more approaches, but this is the one I'm using for now.

Semanteme answered 6/10, 2008 at 21:23 Comment(0)
P
0

Honestly, smarter people than I have figured this stuff out. I'd load up the solr app and make json calls from my appengine app and let solr take care of indexing.

Perrone answered 6/10, 2008 at 22:49 Comment(0)
S
0

I just found this article this weekend: http://www.perl.com/pub/a/2003/02/19/engine.html

Looks not too complicated to do a simple one (though it would need heavy optimizing to be an enterprise type solution for sure). I plan on trying a proof of concept with some data from Project Gutenberg.

If you're just looking for something you can explore and learn from I think this is a good start.

Skip answered 7/10, 2008 at 2:40 Comment(0)
W
0

Look into the book "Managing Gigabytes" it covers storage and retrieval of huge amounts of plain text data -- eg. both compression and actual searching, and a variety of the algorithms that can be used for each.

Also for plain text retrieval you're best off using a vector based search system rather than a keyword->document indexing system as vector based systems can be much faster, and, more importantly can provide relevancy ranking relatively trivially.

Whit answered 8/10, 2008 at 2:26 Comment(0)
C
-1

Try this: Say the variable table is your list of search entries.

query = input("Query: ").strip().lower()#Or raw_input, for python 2
end = []
for item in table:
    if query in item.strip().lower():
        end.append(item)

print end #Narrowed results

It just iterates through all of the items to see if the query is in any one of them. It works for a simple in-app search function. Maybe not for the whole internet, though.

Cangue answered 4/1, 2017 at 21:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.