algorithm to calculate similarity between texts
Asked Answered
C

3

7

I am trying to score similarity between posts from social networks, but didn't find any good algorithms for that, thoughts?

I just tried Levenshtein, JaroWinkler, and others, but those one are more used to compare texts without sentiments. In posts we can get one text saying "I really love dogs" and an other saying "I really hate dogs", we need to classify this case as totally different.

Thanks

Catboat answered 27/8, 2010 at 6:49 Comment(0)
C
1

You might want to have a look at Opinion mining and sentiment analysis to give you an idea of the complexity of the task.

Short answer: there a no "good algorithms" for this, only mediocre ones. And this is a very hard problem. Good luck.

Centesimal answered 27/8, 2010 at 7:9 Comment(0)
R
4

Ahh... but "I really love dogs" and "I really hate dogs" are totally similar ;), both discuss one's feelings towards dogs. It seems that you're missing a step in there:

  1. Run your algorithm and get the general topic groups (i.e. "feelings towards dogs").
  2. Run your algorithm again, but this time on each previously "discovered" group and let your algorithm further classify them into subgroups (i.e. "i hate dogs"/"i love dogs").

If your algorithm adjusts itself based on its experience (i.e. there some learning involved)., then make sure you run separate instances of the algorithm for the first classification, and a new instance of the algorithm for each sub-classification... if you don't, you may end up with a case where you find some groups and any time you run your algo on the same groups the results are nearly identical and/or nothing has changed at all.

Update

Apache Mahout provides a lot of useful algorithms and examples of Clustering, Classification, Genetic Programming, Decision Forest, Recommendation Mining. Here are a some of the text classification examples from mahout:

I'm not sure which one would best apply to your problem, but maybe if you look them over you'll figure out which one is the most suitable for your specific application.

Roseleeroselia answered 27/8, 2010 at 15:57 Comment(8)
They can be even more similar. For example, what if I say I love dogs, but mean it sarcastically? It is impossible to understand the exact meaning and sentiment of text posted online, because there is so much important information exposed through tone and diction.Latterll
You are right. I've seem some guys using Self-Organizing Map for spam detection, maybe it would be a good candidate, thoughts?Catboat
@Daniel, given a single post, I don't think that even a human will be able to determine if it's sarcastic or if it's genuine. Sarcasm only works in context, so yes, if it's sarcastic then it will be a false classification.Roseleeroselia
@user430830, there are a lot of algorithms that might be useful... self-organizing maps, Bayes classifiers, collaborative filtering, etc. Apache Mahout offers a rich and scalable library with many useful algorithms, if you're familiar with java, then give it a try.Roseleeroselia
Yeah, I am a Java developer. Do you know any article about mahout and text similarity? My task is to find similar people by comparing their posts or articles.Catboat
@user430830, I updated my answer with some references... I hope that helps.Roseleeroselia
interesting. if you are looking to group 'similar people' you might expect a PCA on the actual words to give good resultsBrythonic
Here's some research done on sarcasm detection: staff.science.uva.nl/~otsur/papers/sarcasmAmazonICWSM10.pdfNabokov
C
2

My research is about sentiment analysis, and I agree with Pierre, it's a hard problem, and given its subjective nature, no general algorithm exists. One of the approaches I had first tried was mapping the sentences into an emotional space and decide on its sentiment regarding the distance of the sentence to the sentiment centroids. You may have a look at it at:

http://dtminredis.housing.salle.url.edu:8080/EmoLib/

The sentences above work well ;)

Crosscut answered 30/8, 2010 at 13:26 Comment(0)
C
1

You might want to have a look at Opinion mining and sentiment analysis to give you an idea of the complexity of the task.

Short answer: there a no "good algorithms" for this, only mediocre ones. And this is a very hard problem. Good luck.

Centesimal answered 27/8, 2010 at 7:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.