how to find similar sentences / phrases in R?
Asked Answered
I

2

8

Example, I have billions of short phrases, and I want to clusters of them that are similar.

> strings.to.cluster <- c("Best Toyota dealer in bay area. Drive out with a new car today",
                        "Largest Selection of Furniture. Stock updated everyday" , 
                        " Unique selection of Handcrafted Jewelry",
                        "Free Shipping for orders above $60. Offer Expires soon",
                        "XXXX is where smart men buy anniversary gifts",
                        "2012 Camrys on Sale. 0% APR for select customers",
                        "Closing Sale on office desks. All Items must go" 
                         )

assume that this vector is hundreds of thousands of rows. Is there a package in R to cluster these phrases by meaning? or could someone suggest a way to rank "similar" phrases by meaning to a given phrase.

Intoxicated answered 26/1, 2012 at 5:35 Comment(1)
How do you propose to define "meaning"? Which ones of your example phrases should be clustered together?Potvaliant
B
9

You can view your phrases as "bags of words", i.e., build a matrix (a "term-document" matrix), with one row per phrase, one column per word, with 1 if the word occurs in the phrase and 0 otherwise. (You can replace 1 with some weight that would account for phrase length and word frequency). You can then apply any clustering algorithm. The tm package can help you build this matrix.

library(tm)
library(Matrix)
x <- TermDocumentMatrix( Corpus( VectorSource( strings.to.cluster ) ) )
y <- sparseMatrix( i=x$i, j=x$j, x=x$v, dimnames = dimnames(x) )  
plot( hclust(dist(t(y))) )
Bosomed answered 26/1, 2012 at 6:19 Comment(6)
Going off of Vincent's suggestion there's a dissimilarity stat in the tm package that takes numerous distance arguments including "pearson". You could use some sort of level of similarity/dissimilaerty and select only the sentences that meat the set criteria.Mathi
@TylerRinker, thanks for your question. I am thinking of mostly phrases related in meaning. In my example, "closing sale on office desks.." and "Largest Selection of Furniture..." to be clustered together (along with possibly others)Intoxicated
If this approach does not work (you would need, for instance, many sentences with both the "desk" and "furniture" words to automatically identify them as being related), you can either add some knowledge about the meaning of the words (there is a wordnet package, that knows that a desk is a piece of furniture) or manually tag some of of the sentences (put them in different classes, e.g., "cars", "furniture", "travel", "food", etc.) and use them as a training set to automatically tag the rest of the data.Bosomed
Similar discussion on SE link but not necessarily in RMathi
@Vincent, which clustering algorithm did you end up using for this? I have the same exact problem.Amaze
@climatewarrior: My answer used hierarchical clustering (hclust), but you can try other algorithms: they are listed in the clustering task view.Bosomed
D
1

Maybe looking at this document: http://www.inside-r.org/howto/mining-twitter-airline-consumer-sentiment could help, it uses R and looks at market sentiment for airlines using twitter.

Durmast answered 26/1, 2012 at 5:42 Comment(2)
that is an interesting approach but appears more suited for classification (e.g, good/bad, +ve/-ve) and not for the clustering / meaning-based similarity metric that I am interested in.Intoxicated
@sgtpepper Perhaps the package tm could be a good place to start looking.Durmast

© 2022 - 2024 — McMap. All rights reserved.