unsupervised semantic clustering of phrases
Asked Answered
P

1

7

I have about a thousand potential survey items as a vector of strings that I want to reduce to a few hundred. Normally when we talk about data reduction, we have actual data. I administer the items to participants and use factor analysis, PCA, or some other dimension reduction method.

In my case, I don't have any data. Just the items (i.e., text strings). I want to reduce the set by eliminating items with similar meanings. Presumably they would be highly correlated if actually administered to participants.

I've been reading about clustering approaches for textual analysis. This SO question demonstrates an approach I've seen used in different examples. The OP notes that the clustering solution does not quite answer his/her question. Here's how it would be applied (unsatisfactorily) in my case:

# get data (2 columns, 152 rows)

link to text.R file with dput() of sample items

# clustering
library(tm)
library(Matrix)
x <- TermDocumentMatrix( Corpus( VectorSource(text$item) ) )
y <- sparseMatrix( i=x$i, j=x$j, x=x$v, dimnames = dimnames(x) )  
plot( hclust(dist(t(y))) )

The plot shows that items 145 and 149 are clustered:

145 "Lets you know you are not wanted"

149 "Lets you know he loves you"

These items share the same stem, "lets you know", which probably accounts for the clustering. Semantically, they are opposites.

The OP had a similar challenge with his/her example. A commenter pointed to the wordnet package as a possible solution.

Question (edited based on feedback)

How can I prevent items like 145 and 149 from clustering because they share stems?

A secondary question with less programmatic focus: Does anyone see better solution here? Many approaches I've come across involve supervised learning, test/training datasets, and classification. I believe what I am looking for is more semantic similarity/clustering (e.g., FAC pdf).

Pederast answered 16/6, 2014 at 11:14 Comment(4)
You could remove some stopwords. Mathew Jockers uses an approach where he removes everything except nouns that may be useful.Vonvona
What is prompting folks to vote to close? I provided a minimal sample dataset, included the code I tried, explained why the code did not produce the result I am seeking, and asked for ideas about alternatives. There is a conceptual piece here for sure, but I think someone who has encountered this before could provide a programming solution in R that achieves the objective. The crowd must know best about closing, but I am little confused.Pederast
I didn't vote to close but it may be that the question is more about content and less about coding. Maybe an edit to make the question closer to being about coding.Vonvona
Thanks, @TylerRinker. I edited the question to focus specifically on the coding challenge.Pederast
M
3

+1 to @TylerRinker's suggestions to

  • remove stopwords and
  • use Jockers' methods of clustering using only nouns (I have a worked example of that here.

Another option you should try is making your term document matrix from bigrams rather than unigrams. If you're interested in phrases, bigrams are a good start. I have a worked example of that here.

Here's a worked example of combining stopword removal with bigrams. With this example you can iterate using different parameter values to get the clustering that seems most sensible to you.

Get the data...

dat <- text <- structure(list(id = c("GHQ1", "GHQ2", "GHQ3", "GHQ4", "GHQ5", 
                                 "GHQ6", "GHQ7", "GHQ8", "GHQ9", "GHQ10", "GHQ11", "GHQ12", "GHQ13", 
                                 "GHQ14", "GHQ15", "GHQ16", "GHQ17", "GHQ18", "GHQ19", "GHQ20", 
                                 "GHQ21", "GHQ22", "GHQ23", "GHQ24", "CGMH9", "GHQ25", "GHQ26", 
                                 "GHQ27", "GHQ28", "GHQ29", "GHQ30", "GHQ31", "PARQ01A-P", "PARQ02A-P", 
                                 "PARQ03A-P", "PARQ04A-P", "PARQ05A-P", "PARQ06A-P", "PARQ07A-P", 
                                 "PARQ08A-P", "PARQ09A-P", "PARQ10A-P", "PARQ11A-P", "PARQ12A-P", 
                                 "PARQ13A-P", "PARQ14A-P", "PARQ15A-P", "PARQ16A-P", "PARQ17A-P", 
                                 "PARQ18A-P", "PARQ19A-P", "PARQ20A-P", "PARQ21A-P", "PARQ22A-P", 
                                 "PARQ23A-P", "PARQ24A-P", "PARQ25A-P", "PARQ26A-P", "PARQ27A-P", 
                                 "PARQ28A-P", "PARQ29A-P", "PARQ30A-P", "PARQ31A-P", "PARQ32A-P", 
                                 "PARQ33A-P", "PARQ34A-P", "PARQ35A-P", "PARQ36A-P", "PARQ37A-P", 
                                 "PARQ38A-P", "PARQ39A-P", "PARQ40A-P", "PARQ41A-P", "PARQ42A-P", 
                                 "PARQ43A-P", "PARQ44A-P", "PARQ45A-P", "PARQ46A-P", "PARQ47A-P", 
                                 "PARQ48A-P", "PARQ49A-P", "PARQ50A-P", "PARQ51A-P", "PARQ52A-P", 
                                 "PARQ53A-P", "PARQ54A-P", "PARQ55A-P", "PARQ56A-P", "PARQ57A-P", 
                                 "PARQ58A-P", "PARQ59A-P", "PARQ60A-P", "PARQ01A-C", "PARQ02A-C", 
                                 "PARQ03A-C", "PARQ04A-C", "PARQ05A-C", "PARQ06A-C", "PARQ07A-C", 
                                 "PARQ08A-C", "PARQ09A-C", "PARQ10A-C", "PARQ11A-C", "PARQ12A-C", 
                                 "PARQ13A-C", "PARQ14A-C", "PARQ15A-C", "PARQ16A-C", "PARQ17A-C", 
                                 "PARQ18A-C", "PARQ19A-C", "PARQ20A-C", "PARQ21A-C", "PARQ22A-C", 
                                 "PARQ23A-C", "PARQ24A-C", "PARQ25A-C", "PARQ26A-C", "PARQ27A-C", 
                                 "PARQ28A-C", "PARQ29A-C", "PARQ30A-C", "PARQ31A-C", "PARQ32A-C", 
                                 "PARQ33A-C", "PARQ34A-C", "PARQ35A-C", "PARQ36A-C", "PARQ37A-C", 
                                 "PARQ38A-C", "PARQ39A-C", "PARQ40A-C", "PARQ41A-C", "PARQ42A-C", 
                                 "PARQ43A-C", "PARQ44A-C", "PARQ45A-C", "PARQ46A-C", "PARQ47A-C", 
                                 "PARQ48A-C", "PARQ49A-C", "PARQ50A-C", "PARQ51A-C", "PARQ52A-C", 
                                 "PARQ53A-C", "PARQ54A-C", "PARQ55A-C", "PARQ56A-C", "PARQ57A-C", 
                                 "PARQ58A-C", "PARQ59A-C", "PARQ60A-C"), item = c("Been feeling unhappy or depressed", 
                                                                                  "Been feeling reasonably happy, all things considered", "Feeling edgy and bad-tempered", 
                                                                                  "Feel constantly under strain", "Found everything getting on top of you", 
                                                                                  "Been feeling nervous and strung-up all the time", "found at times you couldn't do anything because your nerves were too bad", 
                                                                                  "found everything getting on top of you", "thought of the possibility that you might make away with yourself", 
                                                                                  "found that the idea of taking your own life kept coming into your mind?", 
                                                                                  "found yourself withing you were dead and away from it all?", 
                                                                                  "felt that life isn't worth living", "felt that life was entirely hopeless?", 
                                                                                  "been able to enjoy your normal day-to-day activities", "been satisfied with the way you've carried out your task", 
                                                                                  "felt that you are playing a useful part in things", "felt on the whole you were doing things well?", 
                                                                                  "been feeling perfectly well and in good health", "been feeling in need of a good tonic", 
                                                                                  "been feeling run down and out of sorts?", "felt that you are ill", 
                                                                                  "been getting any pains in your head", "been getting a feeling of tightness or pressure in your head", 
                                                                                  "been having hot or cold spells", "Do you feel you have physical problems because of stress?", 
                                                                                  "Lost sleep over worry", "Had difficulty in staying asleep once you are off", 
                                                                                  "felt capable of making decisions about things", "been taking longer over the things that you do", 
                                                                                  "been managing to keep yourself busy and occupied", "been thinking of yourself as a worthless person", 
                                                                                  "been getting scared or panicky for no good reason ", "You say nice things about your child", 
                                                                                  "You nag or scold your child when (s)he is bad", "You ignore your child", 
                                                                                  "You wonder if you really love your child", "You talk to your child about daily routines and plans, and listen to what (s)he has to say", 
                                                                                  "You complain about your child to others when (s)he does not listen to you", 
                                                                                  "You take an interest in your child", "You want your child to bring friends home, and you try to make things pleasant for them", 
                                                                                  "You call your child names and make fun of him/her", "You ignore your child as long as (s)he does nothing to bother you", 
                                                                                  "You yell at your child when you are angry", "You sit close with your child so that (s)he feels free to talk about important things", 
                                                                                  "You are harsh with your child", "You enjoy having your child around you", 
                                                                                  "You make your child feel proud when (s)he does well", "Your hit your child even when (s)he may not deserve it, like for small mistakes", 
                                                                                  "You forget things you are supposed to do for your child", "You see your child as an annoyance", 
                                                                                  "You praise your child to others", "You punish your child when you are angry", 
                                                                                  "You make sure your child has the right kind of food to eat", 
                                                                                  "You talk to your child in a warm and loving way", "You get angry easily at your child", 
                                                                                  "You are too busy to answer your child's questions", "You hate/despise your child", 
                                                                                  "You say nice things to your child when (s)he deserves it, such as when (s)he does well in school", 
                                                                                  "You are irritable with your child", "You care about who your child's friends are", 
                                                                                  "You are really interested in what your child does", "You say many unkind things to your child", 
                                                                                  "You pay no attention to your child when (s)he asks for help", 
                                                                                  "You think it is your child's own fault when (s)he is having trouble", 
                                                                                  "You make your child feel wanted and needed", "You tell your child (s)he annoys you", 
                                                                                  "You pay a lot of attention to your child", "You tell your child how proud you are of him/her when (s)he is good", 
                                                                                  "You hurt your child's feelings", "You forget important things your child thinks you should remember", 
                                                                                  "When your child misbehaves, you make him/her feel unloved", 
                                                                                  "You make your child feel what (s)he does is important", "When your child does something wrong, you frighten or threaten him/her", 
                                                                                  "You like to spend time with your child, for example you sit and laugh together", 
                                                                                  "You try to help your child when (s)he is scared or upset", "When your child misbehaves, you shame him/her in front of his/her friends", 
                                                                                  "You avoid your child's company", "You complain about your child", 
                                                                                  "You care about what your child thinks, and encourage him/her to talk about it", 
                                                                                  "You feel other children are better than your own child", "When you make plans, you take your child's thoughts into consideration", 
                                                                                  "You let your child do things (s)he thinks are important, even if it is hard for you", 
                                                                                  "When your child misbehaves, you compare him/her unfavorably with other children", 
                                                                                  "You want to leave your child in someone else's care (for example, a neighbor or relative)", 
                                                                                  "You let your child know (s)he is not wanted", "You are interested in the things your child does", 
                                                                                  "You try to make your child feel better when (s)he is hurt or sick", 
                                                                                  "You tell your child you are ashamed of him/her when (s)he misbehaves", 
                                                                                  "You let your child know you love him/her", "You treat your child gently and with kindness", 
                                                                                  "When your child misbehaves, you make him/her feel ashamed or guilty", 
                                                                                  "You try to make your child happy", "Says nice things about you", 
                                                                                  "Nags or scolds you when you are bad", "Ignores you", "Does not really love you", 
                                                                                  "Talks to you about your plans and listens to what you have to say", 
                                                                                  "Complains about you to others when you do not listen to him", 
                                                                                  "Takes an interest in you", "Wants you to bring your friends home, and tries to make things pleasant for them", 
                                                                                  "Calls you names, ridicules you, and makes fun of you", "Ignores you as long as you do nothing to bother him", 
                                                                                  "Yells at you when he is angry", "Sits close with you so that you feel free to talk about important things", 
                                                                                  "Treats you harshly", "Enjoys having you around him", "Make you feel proud when you do well", 
                                                                                  "Hits you even when you do not deserve it, like for small mistakes", 
                                                                                  "Forgets things he is supposed to do for you", "Sees you as an annoyance", 
                                                                                  "Praises you to others", "Punishes you severely when he is angry", 
                                                                                  "Makes sure you have the right kind of food to eat", "Talks to you in a warm and loving way", 
                                                                                  "Gets angry at you easily", "Is too busy to answer your questions", 
                                                                                  "Seems to hate / despise you", "Says nice things to you when you deserve them, such as when you do well in school", 
                                                                                  "Gets mad quickly and picks on you", "Wants to know who your friends are", 
                                                                                  "Is really interested in what you do", "Says many unkind things to you", 
                                                                                  "Pays no attention when you ask for help", "Thinks it is your own fault when you are having trouble", 
                                                                                  "Makes you feel wanted and needed", "Tells you that you annoy him", 
                                                                                  "Pays a lot of attention to you", "Tells you how proud he is of you when you are good", 
                                                                                  "Goes out of his way to hurt your feelings", "Forgets important things you think he should remember", 
                                                                                  "Makes you feel unloved if you misbehave", "Makes you feel what you do is important", 
                                                                                  "Frightens or threatens you when you do something wrong", "Likes to spend time with you, for example you sit and laugh together", 
                                                                                  "Tries to help you when you are scared or upset", "Shames you in front of your friends when you misbehave", 
                                                                                  "Tries to stay away from you", "Complains about you and talks about you behind your back", 
                                                                                  "Cares about what you think, and likes you to talk about it", 
                                                                                  "Feels other children are better than you are no matter what you do", 
                                                                                  "Cares about what you would like when he makes plans", "Lets you do things you think are important, even if it is hard for him", 
                                                                                  "Thinks other children behave better than you do", "Wants other people to take care of you (for example, a neighbor or relative)", 
                                                                                  "Lets you know you are not wanted", "Is interested in the things you do", 
                                                                                  "Shows concern and tries to make you feel better when you are hurt or sick", 
                                                                                  "Tells you how ashamed he is when you misbehave", "Lets you know he loves you", 
                                                                                  "Treats you gently and with kindness", "Makes you feel ashamed or guilty when you misbehave", 
                                                                                  "Tries to make you happy")), .Names = c("id", "item"), row.names = c(NA, 
                                                                                                                                                       152L), class = "data.frame")

Now make a tdm of bigrams, then remove bigrams containing stopwords...

library("RWeka")
library("tm")
library("Matrix")    
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
x <- TermDocumentMatrix(Corpus(VectorSource(text$item)), control = list(tokenize = BigramTokenizer))
# little bit of regex to remove bigrams with stopwords in them, cf. https://mcmap.net/q/514846/-matching-multiple-patterns
stpwrds <- paste(stopwords("en"), collapse = "|")
x$dimnames$Terms[!grepl(stpwrds, x$dimnames$Terms)]
[1] "cold spells"  "else s"       "feel free"    "feel proud"  [5] "feel unloved" "feels free"   "lost sleep" 

A quick test with removing bigrams using the stock list of stopwords that comes with the tm package shows that it only leaves us with 8 bigrams! Clearly we need a smaller stopword list, so let's make a custom list by finding the most frequent words in this particular corpus and removing those.

# find freq words in corpus
x <- TermDocumentMatrix(Corpus(VectorSource(text$item)))
# arbitrary choice of 10 occurances = hi freq
mystopwords <- findFreqTerms(x, 10, Inf)

You should experiment with the lowfreq value, I've set it at 10 after trying just a few, but other values might be better.

# try to filter the bigrams again with custom stopword list
x <- TermDocumentMatrix(Corpus(VectorSource(text$item)), control = list(tokenize = BigramTokenizer))
# little bit of regex to remove bigrams with mystopwords in them, cf. https://mcmap.net/q/514846/-matching-multiple-patterns
mystpwrds <- paste(mystopwords, collapse = "|")
# subset tdm to keep only bigrams remaining after mystopwords removed
x <- x[x$dimnames$Terms[!grepl(mystpwrds, x$dimnames$Terms)],]
y <- sparseMatrix( i=x$i, j=x$j, x=x$v, dimnames = dimnames(x) )  
plot( hclust(dist(t(y))) )

enter image description here

But that's a bit hard to read, so let's print the group members out like so

hc <- hclust(dist(t(y)))
cutree(hc, k = 100)

 1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16 
  1   2   3   4   5   3   6   5   3   7   8   9  10  11  12  13 
 17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32 
 14  15  16  17   3  18  19  20  21  22  23  24  25  26  27  28 
 33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48 
  3  29   3  30  31  32  33  34  35  36   3  37   3   3  38  39 
 49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 
 40  41   3   3  42  43  44  45   3  46   3   3  47  48  49  50 
 65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
  3  51  52  53   3   3   3  54  55  56  57  58   3   3   3  59 
 81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96 
 60  61   3  62  63  47  64  65   3   3  66   3   3  67   3  68 
 97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 
 69  70  33  71  35  72  73  74   3  75   3  76  40  41   3  73 
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 
 42  43  77  45  78  79  80  81  47  48  82  83   3   3  84  85 
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 
 86  87   3  88  89  90  91  92  93   3   3  59   3  94  95  96 
145 146 147 148 149 150 151 152 
  3  97  98  99 100   3  66   3 

And we see that we have rows 145 and 149 in different groups. Whether or not this is a good answer is hard to tell, since you don't specify what your desired output should be. That's why you got close votes, it's difficult to infer from your question what a good answer would be (more specifically, the problem is that you ask to: "recommend or find a tool, library or favorite off-site resource" which lead to opinionated answers), the SO crowd seem to prefer a concrete example of the desired output. You might try your question at the new data science stack exchange site.

Anyway, hopefully you've now got a few more ideas and a few more knobs to twiddle in your explorations of the data. Feel free to ask another question if you come up against a specific programming problem related to this.

Mariehamn answered 17/6, 2014 at 6:29 Comment(7)
thanks for putting some time into creating a very helpful answer. I'm running through it now. Getting an error on x <- TermDocumentMatrix(Corpus(VectorSource(text$item)), control = list(tokenize = BigramTokenizer)): Error in rep(seq_along(x), sapply(tflist, length)) : invalid 'times' argument In addition: Warning message: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'Pederast
Probably something to do with parallelisation or Java. That's a common error, no doubt you've already googled it and found these: https://mcmap.net/q/719312/-bigrams-instead-of-single-words-in-termdocument-matrix-using-r-and-rweka/1036500 & https://mcmap.net/q/942353/-finding-ngrams-in-r-and-comparing-ngrams-across-corpora/1036500Mariehamn
yes, setting options(mc.cores=1) before NGramTokenizer() is called does the trick.Pederast
I think the following error should not be specific to my setup: x <- x[x$dimnames$Terms[!grepl(mystpwrds, x$dimnames$Terms)]] gives Error in x$nrow : $ operator is invalid for atomic vectors. The inner portion, x$dimnames$Terms[!grepl(mystpwrds, x$dimnames$Terms)], works. Adding a comma gets it to run: x <- x[x$dimnames$Terms[!grepl(mystpwrds, x$dimnames$Terms)],].Pederast
Good catch, I've edited my answer. Does that answer your question?Mariehamn
I mean, is this approach better than the basic hclust for your specific use case?Mariehamn
Yes, @Ben. It is a definite improvement. I'm still searching for a method that looks at semantic similarity, but this is very helpful.Pederast

© 2022 - 2024 — McMap. All rights reserved.