Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus
Asked Answered
U

2

6

Following the many guides to creating biGrams using the 'tm' and 'RWeka' packages, I was getting frustrated that only 1-Grams were being returned in the tdm. Through much trial and error I discovered that proper function was achieved using 'VCorpus' but not using 'Corpus'. BTW, I'm pretty sure this was working with 'Corpus' ~1 month ago but it is not now.

R (3.3.3), RTools (3.4), RStudio (1.0.136) and all packages (tm 0.7-1, RWeka 0.4-31) have been updated to the latest.

I would appreciate any insight on what this won't work with Corpus and if others have this same problem.

#A Reproducible example
#
#Weka bi-gram test
#

library(tm)
library(RWeka)

someCleanText <- c("Congress shall make no law respecting an establishment of",
                    "religion, or prohibiting the free exercise thereof or",
                    "abridging the freedom of speech or of the press or the",
                    "right of the people peaceably to assemble and to petition",
                    "the Government for a redress of grievances")

aCorpus <- Corpus(VectorSource(someCleanText))   #With this, only 1-Grams are created
#aCorpus <- VCorpus(VectorSource(someCleanText)) #With this, biGrams are created as desired

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))

aTDM <- TermDocumentMatrix(aCorpus, control=list(tokenize=BigramTokenizer))

print(aTDM$dimnames$Terms)

Result with 'Corpus'

 [1] "congress"      "establishment" "law"           "make"         
 [5] "respecting"    "shall"         "exercise"      "free"         
 [9] "prohibiting"   "religion"      "the"           "thereof"      
[13] "abridging"     "freedom"       "press"         "speech"       
[17] "and"           "assemble"      "peaceably"     "people"       
[21] "petition"      "right"         "for"           "government"   
[25] "grievances"    "redress"

Result with 'VCorpus'

 [1] "a redress"        "abridging the"    "an establishment" "and to"          
 [5] "assemble and"     "congress shall"   "establishment of" "exercise thereof"
 [9] "for a"            "free exercise"    "freedom of"       "government for"  
[13] "law respecting"   "make no"          "no law"           "of grievances"   
[17] "of speech"        "of the"           "or of"            "or prohibiting"  
[21] "or the"           "peaceably to"     "people peaceably" "press or"        
[25] "prohibiting the"  "redress of"       "religion or"      "respecting an"   
[29] "right of"         "shall make"       "speech or"        "the free"        
[33] "the freedom"      "the government"   "the people"       "the press"       
[37] "thereof or"       "to assemble"      "to petition"
Unicuspid answered 13/3, 2017 at 5:33 Comment(7)
not reproducible, which version of R / RWeka you are using?Clumsy
Thanks for trying. Does 'not reproducible' mean you couldn't execute to reproduce or that you got the expected results, unlike me? The versions that I'm using are listed in original question: R (3.3.3), RWeka (0.4-31). Both in addition to 'tm', RTools and RStudio were updated within a day of the original post to the latest available versions.Unicuspid
I have the same problem as Paul_J. It has nothing to do with Weka, because I can produce a similar behaviour with selfwritten tokenizers. I am using R version 3.3.2 (2016-10-31) and tm 0.7.1Sportive
hi @SandipanDey, @paul-j, there seems to be a related question on SO here. I came upon this problem today, with R, RWeka, RStudio all using the current version as of this writing.Pricilla
can you produce a sessionInfo() for us?Pricilla
I reprodced the exact same result with this (truncated) sessionInfo(): R version 3.4.0 (2017-04-21) Platform: x86_64-w64-mingw32/x64 (64-bit) other attached packages: [1] LDAvis_0.3.2 stringi_1.1.5 dplyr_0.7.2 topicmodels_0.2-6 bindrcpp_0.2 tm_0.7-1 [7] magrittr_1.5 openNLPmodels.en_1.5-1 RWeka_0.4-34 openNLP_0.2-6 NLP_0.1-10Spy
With Corpus I get unigrams; With VCorpus, I get an error message when I call TermDocumentMatrix() I get the following : Error in .jcall(man, "Ljava/lang/Object;", "objectForName", as_qualified_name(name)) : java.lang.IncompatibleClassChangeError: Implementing classAccra
L
0

I was working with R.3.4.1 and changed to R3.3.3, now the VCorpus solution worked for me. Both TM and RWeka create the bigrams correctly.

sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Leavetaking answered 28/8, 2017 at 11:37 Comment(0)
P
0

I was able to reproduce exactly the same results you got.

When I started reading about Corpus and VCorpus most references pointed out that the difference was basically that VCorpus was a volatile Corpus that stays in memory, but it is not the only difference. Corpus uses SimpleCorpus as default which does not have all the properties that VCorpus has, that is why you are able to get the 2-grams using VCorpus and not with regular Corpus. For more information on this one go to this posting in stackexchange: https://stats.stackexchange.com/questions/164372/what-is-vectorsource-and-vcorpus-in-tm-text-mining-package-in-r

Py answered 26/10, 2019 at 22:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.