Algorithm for search in inverted index
Asked Answered
I

2

6

Consider there are 10 billion words that people have searched for in google. Corresponding to each word you have the sorted list of all document id's. The list looks like this:

[Word 1]->[doc_i1,doc_j1,.....]
[Word 2]->[doc_i2,doc_j2,.....]
...
...
...
[Word N]->[doc_in,doc_jn,.....]

I am looking for an algorithm to find 100 rare word-pairs. A rare word-pair is a pair of words that occur together(not necessarily contiguous) in exactly 1 document.

I am looking for something better than O(n^2) if possible.

Ishtar answered 5/2, 2014 at 16:43 Comment(16)
The complexity must be also a function of documents amount. What is your complexity requirement in terms of both words and documents?Pussyfoot
Assume the maximum number of documents a word can appear is m, then the intersection of two word (documents in which both the words are present) can be found in O(2*m) worst case. In this case a single rare word pair can be found in O(n^2*m) (brute force). I am looking for something better than this.Ishtar
You can easily get O(d * w^2 * m) where w is the maximum number of words per document and d is the number of documents. Is that better?Impressure
@zvisofer: You can get O(m) with high probability using hashing. And sorting all the lists is O(n log m) or even linear time using integer sorts, so it's easy to do as a preprocessing step if necessaryImpressure
@NiklasB. But for this I have to invert the whole data structure from word-doc to doc-word which itself can take lot of time. Also the number of words in a document can go very large. Also it would be O(d^2 * w^2). Can you please explain if i am wrong?Ishtar
@zvisofer: With high probability != expected runtimeImpressure
@Pussyfoot The document lists are sorted as given in question. I am looking for worst case. NiklasB. Can you explain how can you do it in O(m) in expected time( high probability)Ishtar
@Microbotz: Yeah, you'd have to invert the index, but that's linear time and thus way faster than the algorithm you want to run afterwards. What I was thinking is that you only look at word pairs that actually occur together in at least one document, by enumerating all the word pairs in all documents. Then you verify Whether they occur in exactly one document by checking the intersectionImpressure
@Microbotz: You can compute the intersection of two lists in O(m) whp. We were not talking about solving your problem.Impressure
Such a rare pair is traditionally called a "googlewhack". Just sayin.Skateboard
@NiklasB. There will be lots of word pair which will occur in atleast one document even if you remove the stop words (is, a, the, etc). and still you need to do it for all pair of documents. can you please explain how will you do this on O(d)?Ishtar
@Microbotz: You enumerate all the documents in O(d) iterations. In every document you have O(w) words, so you enumerate all the pairs in O(w^2). Now for every pair, you compute the intersection of their document lists in O(m) to check whether they occur exactly once. Whole algorithm is O(d * w^2 * m) but in practice you might find many pairs of words multiple times. You can skip the intersection part if that occurs and save the factor m in many cases. Whether this is better than the original algorim or not very much depends on your data. I don't see from where you get the O(d^2).Impressure
@NiklasB. Yes. We can enumerate the words in O(w^2) and then we can build a hash with word pair as key and keep increasing the count for the word pair if it is found in any other document. At the end we need to traverse the hash table and find the word pairs with just one hit(document). it can be done in O(d*w^2) and time for traversing the hash table. But i doubt how much improvement it will give.Ishtar
@Microbotz: It's easy to find out. Just compute sum_{i=1}^d w_i^2 and compare it to n^2. I don't think there is a deterministic algorithm that won't have either n^2 or w^2 as a factor. Not sure about randomized, but it seems unlikely (hrhr)Impressure
Oh and by the way, I don't think it's reasonable to assume that real Google data is likely to be the "worst case" of your algorithm. For example, I think you might well be able to find such a pair within the rarest 10000 words or so, so you could design your algorithm so that it sorts the words ascending by number of occurences and tries the pairs with decreasing rarityImpressure
@NiklasB. I agree with you. But i was thinking of this problem and was wondering if there exist a better way then brute force for this problem?Ishtar
B
4
  1. You order the words according to the number of documents they occur in. The idea here is, that words that occur rarely at all, will occur rarely in pairs as well. If you find words that occur in exactly one document, just pick any other word from that document and you are done.
  2. Then you start inverting the index, starting with the rarest word. That means you create a map where each document points to the set of words in it. At first you create that inverted index with the rarest word only. After you inserted all documents associated with that rarest word into the inverted index, you have a map where each document points to exactly one word.
  3. Then you add the next word with all its documents, still following the order created in (1.). At some point you will find that a document associated with a word is already present in your inverted map. Here you check all words associated with that document if they form such a rare word pair.

The performance of the thing depends heavily on how far you have to go to find 100 such pairs, the idea is that you are done after processing only a small fraction of the total data set. To take advantage of the fact that you only process a small fraction of the data, you should employ in (1.) a sort algorithm that allows you to find the smallest elements long before the entire set has been sorted, like quick sort. Then the sorting can be done in like O(N*log(N1) ), with N1 being the number of words you actually need to add to the inverted index before finding 100 pairs. The complexity of the other operations, namely adding a word to the inverted index and checking if a word pair occurs in more than one document also is linear with the number of documents per word, so those operations should be speedy at the beginning and slow down later, because later you have more documents per word.

Butcherbird answered 5/2, 2014 at 17:54 Comment(0)
A
0

This is the opposite of "Frequent Itemset Mining"

i.e. check this recent publication: Rare pattern mining: challenges and future perspectives

Ation answered 13/12, 2020 at 8:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.