Algorithms for matching based on keywords intersection

Asked 28/2, 2011 at 13:21 Answered 28/2, 2011 at 14:0

Solved algorithm e-commerce keyword matching cosine-similarity

Suppose we have buyers and sellers that are trying to find each other in a market. Buyers can tag their needs with keywords; sellers can do the same for what they are selling. I'm interested in finding algorithm(s) that rank-order sellers in terms of their relevance for a particular buyer on the basis of their two keyword sets.

Here is an example:

buyer_keywords = {"furry", "four legs", "likes catnip", "has claws"}

and then we have two potential sellers that we need to rank order in terms of their relevance:

seller_keywords[1] = {"furry", "four legs", "arctic circle", "white"}
seller_keywords[2] = {"likes catnip", "furry", 
                      "hates mice", "yarn-lover", "whiskers"}

If we just use the intersection of keywords, we do not get much discrimination: both intersect on 2 keywords. If we divide the intersection count by the size of the set union, seller 2 actually does worse because of the greater number of keywords. This would seem to introduce an automatic penalty for any method not correcting keyword set size (and we definitely don't want to penalize adding keywords).

To put a bit more structure on the problem, suppose we have some truthful measure of intensity of keyword attributes (which have to sum to 1 for each seller), e.g.,:

seller_keywords[1] = {"furry":.05, 
                      "four legs":.05, 
                      "arctic circle":.8, 
                      "white":.1}

seller_keywords[2] = {"likes catnip":.5, 
                      "furry":.4, 
                      "hates mice":.02, 
                      "yarn-lover":.02, 
                      "whiskers":.06}

Now we could sum up the value of hits: so now Seller 1 only gets a score of .1, while Seller 2 gets a score of .9. So far, so good, but now we might get a third seller with a very limited, non-descriptive keyword set:

seller_keywords[3] = {"furry":1}

This catapults them to the top for any hit on their sole keyword, which isn't good.

Anyway, my guess (and hope) is that this is a fairly general problem and that there exist different algorithmic solutions with known strengths and limitations. This is probably something covered in CS101, so I think a good answer to this question might just be a link to the relevant references.

Peshitta answered 28/2, 2011 at 13:21 Comment(1)

i think we should multiply the effective score with the number of matched keywords.for e.g in II'nd case of your e.g we have only 1 match and it has score of 1 thus effective score 1*1=1.But in I'st case where two matches are found we would have effective score being 2*1=2 .Thus it is selected.what do you say about this approach. – Gaol 28/2, 2011 at 13:30

I think you're looking to use cosine similarity; it's a basic technique that gets you pretty far as a first hack. Intuitively, you create a vector where every tag you know about has a particular index:

terms[0] --> aardvark
terms[1] --> anteater
...
terms[N] --> zuckerberg

Then you create vectors in this space for each person:

person1[0] = 0     # this person doesn't care about aardvarks
person1[1] = 0.05  # this person cares a bit about anteaters
...
person1[N] = 0

Each person is now a vector in this N-dimensional space. You can then use cosine similarity to calculate similarity between pairs of them. Calculationally, this is basically the same of asking for the angle between the two vectors. You want a cosine close to 1, which means that the vectors are roughly collinear -- that they have similar values for most of the dimensions.

To improve this metric, you may want to use tf-idf weighting on the elements in your vector. Tf-idf will downplay the importance of popular terms (e.g, 'iPhone') and promote the importance of unpopular terms that this person seems particularly associated with.

Combining tf-idf weighting and cosine similarity does well for most applications like this.

Aylesbury answered 28/2, 2011 at 14:0 Comment(1)

cosine similarity doesn't solve the last problem with {"furry":1}, but perhaps instead of doing this (i.e. taking the dot product of two normalized vectors), you could use the actual dot product. Failing to normalize the buyer doesn't matter, because it applies the same scale factor to all results and they still rank the same. Failing to normalize the seller means that you can weight sellers according to some other criteria, not just how focussed their keyword list is. For a simple example you could cap the strength of any one keyword, so sellers who only list one keyword have magnitude < 1. – Dovecote 28/2, 2011 at 16:2

what you are looking for is called taxonomy. Tagging contents and ordering them by order of relevance.

You may not find some-ready-to-go-algorithm but you can start with a practical case : Drupal documentation for taxonomy provides some guidelines, and check sources of the search module.

Basically, the ranks is based on the term's frequency. If a product is defined with a small number of tags, they will have more weight. A tag which only appear on few products' page means that it is very specific. You shouldn't have your words' intensity defined on a static way ; but examines them in their context.

Regards

Hanghangar answered 28/2, 2011 at 13:44 Comment(1)

This seems more like a specific library for solving the problem rather than an algorithm or mathematical framework for addressing the problem. – Taxiplane 28/2, 2011 at 20:43

Recommended topics

Hot tags