Finding ngrams in R and comparing ngrams across corpora

Asked 27/10, 2013 at 6:8 Answered 22/8, 2014 at 19:13

I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would like to extract newly coined political terms (multiple words, e.g. "struggle-criticism-transformation movement").

This is a two-step question, one regarding my code so far and one regarding how I should go on.

Step 1: To do this, I wanted to identify some common ngrams first. But I get stuck very early on. Here is what I've been doing:

library(tm)
library(RWeka)

a  <-Corpus(DirSource("/mycorpora/1965"), readerControl = list(language="lat")) # that dir is full of txt files
summary(a)  
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english")) 
a <- tm_map(a, stemDocument, language = "english") 
# everything works fine so far, so I start playing around with what I have
adtm <-DocumentTermMatrix(a) 
adtm <- removeSparseTerms(adtm, 0.75)

inspect(adtm) 

findFreqTerms(adtm, lowfreq=10) # find terms with a frequency higher than 10

findAssocs(adtm, "usa",.5) # just looking for some associations  
findAssocs(adtm, "china",.5)

# ... and so on, and so forth, all of this works fine

The corpus I load into R works fine with most functions I throw at it. I haven't had any problems creating TDMs from my corpus, finding frequent words, associations, creating word clouds and so on. But when I try to use identify ngrams using the approach outlined in the tm FAQ, I'm apparently making some mistake with the tdm-constructor:

# Trigram

TrigramTokenizer <- function(x) NGramTokenizer(x, 
                                Weka_control(min = 3, max = 3))

tdm <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))

inspect(tdm)

I get this error message:

Error in rep(seq_along(x), sapply(tflist, length)) : 
invalid 'times' argument
In addition: Warning message:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

Any ideas? Is "a" not the right class/object? I'm confused. I assume there's a fundamental mistake here, but I'm not seeing it. :(

Step 2: Then I would like to identify ngrams that are significantly overrepresented, when I compare the corpus against other corpora. For example I could compare my corpus against a large standard english corpus. Or I create subsets that I can compare against each other (e.g. Soviet vs. a Chinese Communist terminology). Do you have any suggestions how I should go about doing this? Any scripts/functions I should look into? Just some ideas or pointers would be great.

Thanks for your patience!

Bendix answered 27/10, 2013 at 6:8 Comment(7)

I had the same error, for me it worked when I set min different from max in Weka control... Don´t know if this is an option for you.... – Commorancy 27/10, 2013 at 8:48

Thanks for your advice! Didn't work for me, though. The error message remains the same when I change the min/max values. – Bendix 27/10, 2013 at 9:59

Just in case people ever find this or are interested: I have not actually solved the first problem, but did manage to work around it by using a similar function provided by the RTextTools package: matrix <- create_matrix(corpus,ngramLength=3) – Bendix 28/10, 2013 at 14:43

Can you share some of your data (on a free temporary file hosting site, perhaps), that will help with reproducing your problem and finding solutions. – Joni 29/10, 2013 at 3:46

Thank you. Yes, I have uploaded a corpus sample here: s000.tinyupload.com/index.php?file_id=46554569218218543610 – Bendix 29/10, 2013 at 6:18

How would this be done with unstructured binary data? Say, on binary patterns within an EXE or PDF file, without decoding or analyzing the file format's structure? – Delicacy 21/1, 2014 at 18:45

Just set the amount of available cores to 1: options(mc.cores=1) – Strobilaceous 27/10, 2015 at 17:42

I could not reproduce your problem, are you using the latest versions of R, tm, RWeka, etc.?

require(tm)
a <- Corpus(DirSource("C:\\Downloads\\Only1965\\Only1965"))
summary(a)  
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english")) 
# a <- tm_map(a, stemDocument, language = "english") 
# I also got it to work with stemming, but it takes so long...
adtm <-DocumentTermMatrix(a) 
adtm <- removeSparseTerms(adtm, 0.75)

inspect(adtm) 

findFreqTerms(adtm, lowfreq=10) # find terms with a frequency higher than 10
findAssocs(adtm, "usa",.5) # just looking for some associations  
findAssocs(adtm, "china",.5)

# Trigrams
require(RWeka)
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))
tdm <- removeSparseTerms(tdm, 0.75)
inspect(tdm[1:5,1:5])

And here's what I get

A term-document matrix (5 terms, 5 documents)

Non-/sparse entries: 11/14
Sparsity           : 56%
Maximal term length: 28 
Weighting          : term frequency (tf)

                                   Docs
Terms                               PR1965-01.txt PR1965-02.txt PR1965-03.txt
  â€ chinese press                              0             0             0
  â€ renmin ribao                               0             1             1
  â€” renmin ribao                              2             5             2
  â€œ chinese people                            0             0             0
  â€œrenmin ribaoâ€\u009d editorial             0             1             0
  etc.

Regarding your step two, here are some pointers to useful starts:

http://quantifyingmemory.blogspot.com/2013/02/mapping-significant-textual-differences.html http://tedunderwood.com/2012/08/14/where-to-start-with-text-mining/ and here's his code https://dl.dropboxusercontent.com/u/4713959/Neuchatel/NassrProgram.R

Joni answered 31/10, 2013 at 6:44 Comment(2)

Thank you again, Ben. I checked my R, RWeka and tm versions and everything seems to be up to date. This error was apparently discussed before (stackoverflow.com/questions/17703553) and you had weighed in that it might have something to do with the Java installation. I tried running the code on a Windows machine and everything went smoothly, so I'm guessing that was the issue. As for Step 2, Ted Underwood's Nassr script appears to do pretty much what I'm looking for, only with words instead of ngrams. I will try to decipher it and learn from it! Thanks! – Bendix 31/10, 2013 at 7:18

No worries. Yes, Java... all I remember about that is that it's the source of a lot of frustration! Glad to hear you've got a few options for getting past that hurdle. Curious to see how your n-grams overrepresentation analysis goes, do post another question on that when you've got some code working. – Joni 31/10, 2013 at 7:33

Regarding Step 1, Brian.keng gives a one liner workaround here https://mcmap.net/q/719312/-bigrams-instead-of-single-words-in-termdocument-matrix-using-r-and-rweka that solves this issue on Mac OSX - it seems to be related to parallelisation rather than ( the minor nightmare that is ) java setup on mac.

Thorma answered 26/3, 2014 at 13:2 Comment(0)

You may want to explicitly access the functions like this

BigramTokenizer  <- function(x) {
    RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 3))
}

myTdmBi.d <- TermDocumentMatrix(
    myCorpus.d,
    control = list(tokenize = BigramTokenizer, weighting = weightTfIdf)
)

Also, some other things that randomly came up.

myCorpus.d <- tm_map(myCorpus.d, tolower)  # This does not work anymore

Try this instead

 myCorpus.d <- tm_map(myCorpus.d, content_transformer(tolower))  # Make lowercase

In the RTextTools package,

create_matrix(as.vector(C$V2), ngramLength=3) # ngramLength throws an error message.

Unbolted answered 22/8, 2014 at 19:13 Comment(0)

Further to Ben's answer - I couldn't reproduce this either, but in the past I've had trouble with the plyr package and conflicting dependencies. In my case there was a conflict between Hmisc and ddply. You could try adding this line just prior to the offending line of code:

tryCatch(detach("package:Hmisc"), error = function(e) NULL)

Apologies if this is completely tangental to your problem!

Jolynnjon answered 23/11, 2013 at 20:2 Comment(0)

Recommended topics

Hot tags