Why does the ngrams() function give distinct bigrams?

Asked 29/9, 2015 at 17:25 Answered 6/10, 2015 at 13:32

I am writing an R script and am using library(ngram).

Suppose I have a string,

"good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"

and want to find bi-grams.

The ngram library is giving me bi-grams as follows:

"appreci product" "process meat" "food product" "food bought" "qualiti dog" "product found" "product look" "look like" "like stew" "good qualiti" "labrador finicki" "bought sever" "qualiti product" "better labrador" "dog food" "smell better" "vital can" "meat smell" "found good" "sever vital" "stew process" "can dog" "finicki appreci" "product better"

As the sentence contains "dog food" two times, I want this bi-gram two times. But I am getting it once!

Is there an option in thengram library or any other library that gives all the bi-grams of my sentence in R?

Webfoot answered 29/9, 2015 at 17:25 Comment(0)

You can use stylo package. Gives duplicates:

library(stylo)
a = "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
b = txt.to.words(a)
c = make.ngrams(b, ngram.size = 2)
print(c)

Result:

 [1] "good qualiti"     "qualiti dog"      "dog food"         "food bought"      "bought sever"     "sever vital"      "vital can"        "can dog"          "dog food"        
[10] "food product"     "product found"    "found good"       "good qualiti"     "qualiti product"  "product look"     "look like"        "like stew"        "stew process"    
[19] "process meat"     "meat smell"       "smell better"     "better labrador"  "labrador finicki" "finicki appreci"  "appreci product"  "product better"  
>

Pentstemon answered 29/9, 2015 at 17:44 Comment(0)

The development version of ngram has a get.phrasetable method:

devtools::install_github("wrathematics/ngram")
library(ngram)

text <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"

ng <- ngram(text)
head(get.phrasetable(ng))
#            ngrams freq       prop
# 1    good qualiti    2 0.07692308
# 2        dog food    2 0.07692308
# 3 appreci product    1 0.03846154
# 4    process meat    1 0.03846154
# 5    food product    1 0.03846154
# 6     food bought    1 0.03846154

In addition, you can use the print() method and specify output == "full". That is:

print(ng, output = "full")

# NOTE: more output not shown...
better labrador | 1 
finicki {1} | 

dog food | 2 
product {1} | bought {1} 
# NOTE: more output not shown...

Matutinal answered 29/9, 2015 at 17:56 Comment(0)

You can use stylo package. Gives duplicates:

library(stylo)
a = "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
b = txt.to.words(a)
c = make.ngrams(b, ngram.size = 2)
print(c)

Result:

 [1] "good qualiti"     "qualiti dog"      "dog food"         "food bought"      "bought sever"     "sever vital"      "vital can"        "can dog"          "dog food"        
[10] "food product"     "product found"    "found good"       "good qualiti"     "qualiti product"  "product look"     "look like"        "like stew"        "stew process"    
[19] "process meat"     "meat smell"       "smell better"     "better labrador"  "labrador finicki" "finicki appreci"  "appreci product"  "product better"  
>

Pentstemon answered 29/9, 2015 at 17:44 Comment(0)

You could use RWeka. In the result you can see "dog food" and "good qualiti" appearing twice

txt <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"


library(RWeka)
RWEKABigramTokenizer <- function(x) {
      NGramTokenizer(x, Weka_control(min = 2, max = 2)) 
}

RWEKABigramTokenizer(txt)

 [1] "good qualiti"     "qualiti dog"      "dog food"         "food bought"      "bought sever"     "sever vital"      "vital can"       
 [8] "can dog"          "dog food"         "food product"     "product found"    "found good"       "good qualiti"     "qualiti product" 
[15] "product look"     "look like"        "like stew"        "stew process"     "process meat"     "meat smell"       "smell better"    
[22] "better labrador"  "labrador finicki" "finicki appreci"  "appreci product"  "product better"

Or use the tm package in combination with RWeka

library(tm)
library(RWeka)
my_corp <- Corpus(VectorSource(txt))
tdm_RWEKA <- TermDocumentMatrix(my_corp, control=list(tokenize = RWEKABigramTokenizer))

#show the 2 bigrams
findFreqTerms(tdm_RWEKA, lowfreq = 2)

[1] "dog food"     "good qualiti"

#turn into matrix with frequency counts
tdm_matrix <- as.matrix(tdm_RWEKA)

Hort answered 29/9, 2015 at 19:47 Comment(0)

In order to produce such bi-gram, you don't need any special package. Basically, split the text and paste it together again.

t <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
ug <- strsplit(t, " ")[[1]]
bg <- paste(ug, ug[2:length(ug)])

The resulted bg would be:

[1] "good qualiti"     "qualiti dog"      "dog food"
[4] "food bought"      "bought sever"     "sever vital"
[7] "vital can"        "can dog"          "dog food"
[10] "food product"     "product found"    "found good"
[13] "good qualiti"     "qualiti product"  "product look"
[16] "look like"        "like stew"        "stew process"
[19] "process meat"     "meat smell"       "smell better"
[22] "better labrador"  "labrador finicki" "finicki appreci"
[25] "appreci product"  "product better"   "better qualiti"

Quacksalver answered 30/9, 2015 at 9:5 Comment(0)

Try the quanteda package:

> quanteda::tokenize(txt, ngrams = 2, concatenator = " ")
[[1]]
 [1] "good qualiti"     "qualiti dog"      "dog food"         "food bought"      "bought sever"     "sever vital"     
 [7] "vital can"        "can dog"          "dog food"         "food product"     "product found"    "found good"      
[13] "good qualiti"     "qualiti product"  "product look"     "look like"        "like stew"        "stew process"    
[19] "process meat"     "meat smell"       "smell better"     "better labrador"  "labrador finicki" "finicki appreci" 
[25] "appreci product"  "product better"

Plenty of additional arguments available through ngrams, including getting different combinations of n sizes, or skip-grams.

Multiphase answered 6/10, 2015 at 13:32 Comment(0)

Recommended topics

Hot tags