how to find similar sentences / phrases in R?

Asked 26/1, 2012 at 5:35 Answered 26/1, 2012 at 6:19

Example, I have billions of short phrases, and I want to clusters of them that are similar.

> strings.to.cluster <- c("Best Toyota dealer in bay area. Drive out with a new car today",
                        "Largest Selection of Furniture. Stock updated everyday" , 
                        " Unique selection of Handcrafted Jewelry",
                        "Free Shipping for orders above $60. Offer Expires soon",
                        "XXXX is where smart men buy anniversary gifts",
                        "2012 Camrys on Sale. 0% APR for select customers",
                        "Closing Sale on office desks. All Items must go" 
                         )

assume that this vector is hundreds of thousands of rows. Is there a package in R to cluster these phrases by meaning? or could someone suggest a way to rank "similar" phrases by meaning to a given phrase.

Intoxicated answered 26/1, 2012 at 5:35 Comment(1)

How do you propose to define "meaning"? Which ones of your example phrases should be clustered together? – Potvaliant 26/1, 2012 at 15:32

You can view your phrases as "bags of words", i.e., build a matrix (a "term-document" matrix), with one row per phrase, one column per word, with 1 if the word occurs in the phrase and 0 otherwise. (You can replace 1 with some weight that would account for phrase length and word frequency). You can then apply any clustering algorithm. The tm package can help you build this matrix.

library(tm)
library(Matrix)
x <- TermDocumentMatrix( Corpus( VectorSource( strings.to.cluster ) ) )
y <- sparseMatrix( i=x$i, j=x$j, x=x$v, dimnames = dimnames(x) )  
plot( hclust(dist(t(y))) )

Bosomed answered 26/1, 2012 at 6:19 Comment(6)

Going off of Vincent's suggestion there's a dissimilarity stat in the tm package that takes numerous distance arguments including "pearson". You could use some sort of level of similarity/dissimilaerty and select only the sentences that meat the set criteria. – Mathi 26/1, 2012 at 16:54

@TylerRinker, thanks for your question. I am thinking of mostly phrases related in meaning. In my example, "closing sale on office desks.." and "Largest Selection of Furniture..." to be clustered together (along with possibly others) – Intoxicated 27/1, 2012 at 5:7

If this approach does not work (you would need, for instance, many sentences with both the "desk" and "furniture" words to automatically identify them as being related), you can either add some knowledge about the meaning of the words (there is a wordnet package, that knows that a desk is a piece of furniture) or manually tag some of of the sentences (put them in different classes, e.g., "cars", "furniture", "travel", "food", etc.) and use them as a training set to automatically tag the rest of the data. – Bosomed 27/1, 2012 at 5:18

Similar discussion on SE link but not necessarily in R – Mathi 28/1, 2012 at 6:42

@Vincent, which clustering algorithm did you end up using for this? I have the same exact problem. – Amaze 25/11, 2012 at 1:5

@climatewarrior: My answer used hierarchical clustering (hclust), but you can try other algorithms: they are listed in the clustering task view. – Bosomed 27/11, 2012 at 11:16

Maybe looking at this document: http://www.inside-r.org/howto/mining-twitter-airline-consumer-sentiment could help, it uses R and looks at market sentiment for airlines using twitter.

Durmast answered 26/1, 2012 at 5:42 Comment(2)

that is an interesting approach but appears more suited for classification (e.g, good/bad, +ve/-ve) and not for the clustering / meaning-based similarity metric that I am interested in. – Intoxicated 26/1, 2012 at 5:57

@sgtpepper Perhaps the package tm could be a good place to start looking. – Durmast 26/1, 2012 at 6:11

Recommended topics

Hot tags