Storing an inverted index
Asked Answered
C

6

6

I am working on a project on Info Retrieval. I have made a Full Inverted Index using Hadoop/Python. Hadoop outputs the index as (word,documentlist) pairs which are written on the file. For a quick access, I have created a dictionary(hashtable) using the above file. My question is, how do I store such an index on disk that also has quick access time. At present I am storing the dictionary using python pickle module and loading from it but it brings the whole of index into memory at once (or does it?). Please suggest an efficient way of storing and searching through the index.

My dictionary structure is as follows (using nested dictionaries)

{word : {doc1:[locations], doc2:[locations], ....}}

so that I can get the documents containing a word by dictionary[word].keys() ... and so on.

Cyclamate answered 10/9, 2010 at 19:29 Comment(1)
Sure you don't want to simply use Sphinx/Lucene/Xapian?Seedman
M
5

shelve

At present I am storing the dictionary using python pickle module and loading from it but it brings the whole of index into memory at once (or does it?).

Yes it does bring it all in.

Is that a problem? If it's not an actual problem, then stick with it.

If it's a problem, what kind of problem do you have? Too slow? Too fast? Too colorful? Too much memory used? What problem do you have?

Mucoid answered 10/9, 2010 at 19:45 Comment(4)
Thank you all for your kind support ... I am supposed to build a text search engine that uses inverted index. So far, I have build a full inverted index. So, the problem I think I may have, is that as index size grows, then bringing it all in would probably consume too much memory(???) .. at present I am only working on a prototype with reduced functionality, so index size is trivial ... but when its finished, it would probably be a large file .... that's the problem that I have. –Cyclamate
@Siddharth Sharma: "consume too much memory(???)". If you don't know, then don't start by trying optimize it. First -- build using a simple dictionary until you can prove that it uses too much memory. Then -- and only after you have proof -- switch to shelve. Later, you'll have to switch to a bloom filter. But only after you can prove that shelve is too slow.Mucoid
@ S.Lott : thanx ... yeah Knuth said, "Early optimization is root of all evil" .... will try keeping it in mind.Cyclamate
@Siddharth Sharma: Keep your goals in mind. (1) Works. (2) Optimal use of resources. You can't do #2 until you've done #1.Mucoid
H
1

I would use Lucene. Why reinvent the wheel?

Helvetii answered 14/9, 2010 at 3:24 Comment(1)
Because it sounds like an academic learning project. I came here looking for something similar for similar purposes.Gus
M
0

Just store it in a string like this:

<entry1>,<entry2>,<entry3>,...,<entryN>

If <entry*> contains ',' character, use some other delimiter like '\t'. This is smaller in size than an equivalent pickled string.

If you want to load it, just do:

L = s.split(delimiter)
Macias answered 10/9, 2010 at 21:1 Comment(0)
T
0

You could store the repr() of the dictionary and use that to re-create it.

Treasonous answered 10/9, 2010 at 21:40 Comment(2)
That'll be space-inefficient. My solution takes less space.Macias
Why does space matter? What problem does the original question actually have? Time? Space? Licensing fees for third party software? There's no hint as to what to optimize.Mucoid
G
0

If it's taking a long time to load or using too much memory, you might need a database. There are many you might use; I would probably start with SQLite. Then your problem is "reduced" ;-) to simply formulating the right query to get what you need out of the database. This way you will only load what you need.

Glint answered 10/9, 2010 at 22:36 Comment(0)
B
0

I am using anydmb for that purpose. Anydbm provides the same dictionary-like interface, except it allow only strings as keys and values. But this is not a constraint since you can use cPickle's loads/dumps to store more complex structures in the index.

Baud answered 17/3, 2011 at 15:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.