Combining TF-IDF (cosine similarity) with pagerank?
Asked Answered
I

4

8

Given a query I have a cosine score for a document. I also have the documents pagerank. Is there a standard good way of combining the two?

I was thinking of multiply them

 Total_Score = cosine-score * pagerank

Because if you get to low on either pagerank or the cosine-score, the document is not interesting.

Or is it preferable to have a weighted sum?

Total_Score = weight1 * cosine-score + weight2 * pagerank

Is this better? Then you might have zero cosine score, but a high pagerank, and the page will show up among the results.

Ivelisseivens answered 18/2, 2013 at 16:12 Comment(1)
The weighted sum is on the right track, but would you want to make hat a wlog(PageRank)? or wlog(1+PageRank)? All this would be a linear combination, wouldn't you want to consider a nonlinear combination instead that has a sigmoid signature?Vittoria
P
3

The weighted sum is probably better as a ranking rule.

It helps to break the problem up into a retrieval/ filtering step and a ranking step. The problem outlined with the weighted sum approach then no longer holds.

The process outlined in this paper by Sergey Brin and Lawrence Page uses a variant of the vector/ cosine model for retrieval and it seems some kind of weighted sum for the ranking where the weights are determined by user activity (see section 4.5.1). Using this approach a document with zero cosine would not get pass the retrieval/ filtering step and thus would not be considered for ranking.

Putupon answered 7/4, 2015 at 20:42 Comment(1)
Actually they mention that IR is a weighted sum of type-weights and count-weights that are computed from hit lists. They say nothing about how they "combine" IR with PR.Moline
L
2

You could consider using a harmonic mean. With a harmonic mean the the 2 scores will essentially be averaged however, low scores will drag the average down more than they would in a regular average.

You could use:

Total_Score = 2*(cosine-score * pagerank) / (cosine-score + pagerank)

Let's say pagerank scored 0.1 and cosine 0.9, the normal average of these two number would be: (0.1 + 0.9)/2 = 0.5, the harmonic mean would be: 2*(0.9*0.1)/(0.9 + 0.1) = 0.18.

Lidalidah answered 5/5, 2015 at 12:18 Comment(2)
Good point about the necessity to drag the average down when scores divergeMoline
However this function gives too much penalty. Example: 1) cos:0.5, pr:0.5 gives total:1, but 2) cos:0.95, pr:0.25 gives total:0.395 ~ Doesn't feel rightMoline
A
0

I understand that you are making a trade-off between the relativity and importance. This is a problem of Multi-objective optimization.

I think your second solution would work. It's the so-called linear scalarization . You must want to know how to optimize the weights. But the methods to do this can be found with different philosophies, and kind of subjective depending on the primacy of each variables case by case. Actually, How to optimize the weights in such a problem is a research area of mathematics. So it's hard to point out which model or method is the fittest one to your case. You might wanna keep going with the wiki links above, and try if you can find some principles on this kind of problems, and then follow them to solve your own case.

Amerson answered 19/12, 2013 at 7:48 Comment(0)
P
-2

I can't imagine a single case where this would be useful. Pagrank computes how "important" a document is measured as connection to other important documents (I assume that's what you mean. Edges are document to document links based on term co-occurences. If you mean something else, please specify).

Cosine score is a similarity metric between two documents. So your thought is to combine a pairwise metric with a node metric to find only important documents similar to another document? Why not just run pagerank on the ego-network of the other document?

Pallua answered 10/6, 2013 at 20:36 Comment(1)
Cosine score is the cosine similarity between the query and the document.Ivelisseivens

© 2022 - 2024 — McMap. All rights reserved.