Consider there are 10 billion words that people have searched for in google. Corresponding to each word you have the sorted list of all document id's. The list looks like this:
[Word 1]->[doc_i1,doc_j1,.....]
[Word 2]->[doc_i2,doc_j2,.....]
...
...
...
[Word N]->[doc_in,doc_jn,.....]
I am looking for an algorithm to find 100 rare word-pairs. A rare word-pair is a pair of words that occur together(not necessarily contiguous) in exactly 1 document.
I am looking for something better than O(n^2) if possible.
O(d * w^2 * m)
wherew
is the maximum number of words per document andd
is the number of documents. Is that better? – ImpressureO(m)
with high probability using hashing. And sorting all the lists isO(n log m)
or even linear time using integer sorts, so it's easy to do as a preprocessing step if necessary – ImpressureO(m)
whp. We were not talking about solving your problem. – ImpressureO(d)
iterations. In every document you haveO(w)
words, so you enumerate all the pairs inO(w^2)
. Now for every pair, you compute the intersection of their document lists inO(m)
to check whether they occur exactly once. Whole algorithm isO(d * w^2 * m)
but in practice you might find many pairs of words multiple times. You can skip the intersection part if that occurs and save the factorm
in many cases. Whether this is better than the original algorim or not very much depends on your data. I don't see from where you get theO(d^2)
. – Impressuresum_{i=1}^d w_i^2
and compare it ton^2
. I don't think there is a deterministic algorithm that won't have eithern^2
orw^2
as a factor. Not sure about randomized, but it seems unlikely (hrhr) – Impressure10000
words or so, so you could design your algorithm so that it sorts the words ascending by number of occurences and tries the pairs with decreasing rarity – Impressure