What algorithm I need to find n-grams?
Asked Answered
P

7

14

What algorithm is used for finding ngrams?

Supposing my input data is an array of words and the size of the ngrams I want to find, what algorithm I should use?

I'm asking for code, with preference for R. The data is stored in database, so can be a plgpsql function too. Java is a language I know better, so I can "translate" it to another language.

I'm not lazy, I'm only asking for code because I don't want to reinvent the wheel trying to do an algorithm that is already done.

Edit: it's important know how many times each n-gram appears.

Edit 2: there is a R package for N-GRAMS?

Palmira answered 17/11, 2011 at 1:53 Comment(3)
there's a text mining package (tm) and a textcat package ... library("sos"); findFn("n-gram")Scarcely
related/near-duplilcate questionStaphylo
Look at cran.r-project.org/web/packages/ngram/vignettes/ngram-guide.pdfCrossman
S
24

If you want to use R to identify ngrams, you can use the tm package and the RWeka package. It will tell you how many times the ngram occurs in your documents, like so:

  library("RWeka")
  library("tm")

  data("crude")

  BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
  tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))

  inspect(tdm[340:345,1:10])

A term-document matrix (6 terms, 10 documents)

Non-/sparse entries: 4/56
Sparsity           : 93%
Maximal term length: 13 
Weighting          : term frequency (tf)

               Docs
Terms           127 144 191 194 211 236 237 242 246 248
  and said        0   0   0   0   0   0   0   0   0   0
  and security    0   0   0   0   0   0   0   0   1   0
  and set         0   1   0   0   0   0   0   0   0   0
  and six-month   0   0   0   0   0   0   0   1   0   0
  and some        0   0   0   0   0   0   0   0   0   0
  and stabilise   0   0   0   0   0   0   0   0   0   1

hat-tip: http://tm.r-forge.r-project.org/faq.html

Staphylo answered 12/4, 2013 at 17:19 Comment(0)
P
8

For anyone still interested in this topic, there is a package on the cran already.

ngram: An n-gram Babbler

This package offers utilities for creating, displaying, and "babbling" n-grams. The babbler is a simple Markov process.

http://cran.r-project.org/web/packages/ngram/index.html

Prober answered 8/10, 2014 at 7:33 Comment(0)
B
3

Usually the n-grams are calculated to find its frequency distribution. So Yes, it does matter how many times the n-grams appear.

Also you want character level n-gram or word level n-gram. I have written a code for finding the character level n-gram from a csv file in r. I used package 'tau' for that. You can find it here.

Also here is the code I wrote:

 library(tau)
temp<-read.csv("/home/aravi/Documents/sample/csv/ex.csv",header=FALSE,stringsAsFactors=F)
r<-textcnt(temp, method="ngram",n=4L,split = "[[:space:][:punct:]]+", decreasing=TRUE)
a<-data.frame(counts = unclass(r), size = nchar(names(r)))
b<-split(a,a$size)
b

Cheers!

Bishop answered 19/7, 2013 at 13:55 Comment(0)
F
1

EDIT: Sorry, this is PHP. I wasn't quite sure what you wanted. I don't know it in java but perhaps the following could be converted easily enough.

Well it depends on the size of the ngrams you want.

I've had quite a lot of success with single letters (especially accurate for language detection), which is easy to get with:

$letters=str_split(preg_replace('/[^a-z]/', '', strtolower($text)));
$letters=array_count_values($letters);

Then there is the following function for calculating ngrams from a word:

function getNgrams($word, $n = 3) {
        $ngrams = array();
        $len = strlen($word);
        for($i = 0; $i < $len; $i++) {
                if($i > ($n - 2)) {
                        $ng = '';
                        for($j = $n-1; $j >= 0; $j--) {
                                $ng .= $word[$i-$j];
                        }
                        $ngrams[] = $ng;
                }
        }
        return $ngrams;
}

The source of the above is here, which I recommend you read, and they have lots of functions to do exactly what you want.

Foveola answered 17/11, 2011 at 1:57 Comment(0)
B
0

You can use ngram package. One example of its usage is http://amunategui.github.io/speak-like-a-doctor/

Bernardinebernardo answered 31/3, 2016 at 16:13 Comment(1)
Hello, welcome to SO. This answer relies almost completely on external links. Should they ever become invalid, your answer would become useless. So please edit it and add at least a summary of what can be found there. Thank you!Mcintire
C
0

Have a look at https://cran.r-project.org/web/packages/ngram/vignettes/ngram-guide.pdf

Here is a quick example. It's quite fast look at the benchmark of the vignette.

require(ngram)

"hi i am ig" %>% ngram(n = 2) %>% get.ngrams()
Crossman answered 26/2, 2017 at 15:48 Comment(0)
I
0

Simple heres the java answer:

int ngrams = 9;// let's say 9-grams since it's the length of "bonasuera"... 
String string = "bonasuera";
for (int j=1; j <= ngrams;j++) {    
    for (int k=0; k < string.length()-j+1;k++ )
        System.out.print(string.substring(k,k+j) + " ");
    System.out.println();
}

output :

b o n a s u e r a 
bo on na as su ue er ra 
bon ona nas asu sue uer era 
bona onas nasu asue suer uera 
bonas onasu nasue asuer suera 
bonasu onasue nasuer asuera 
bonasue onasuer nasuera 
bonasuer onasuera 
bonasuera 
Inflect answered 29/11, 2018 at 16:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.