N-gram generation from a sentence
Asked Answered
D

7

33

How to generate an n-gram of a string like:

String Input="This is my car."

I want to generate n-gram with this input:

Input Ngram size = 3

Output should be:

This
is
my
car

This is
is my
my car

This is my
is my car

Give some idea in Java, how to implement that or if any library is available for it.

I am trying to use this NGramTokenizer but its giving n-gram's of character sequence and I want n-grams of word sequence.

Dissuade answered 7/9, 2010 at 7:53 Comment(0)
P
26

You are looking for ShingleFilter.

Update: The link points to version 3.0.2. This class may be in different package in newer version of Lucene.

Poteat answered 7/9, 2010 at 12:53 Comment(0)
A
45

I believe this would do what you want:

import java.util.*;

public class Test {

    public static List<String> ngrams(int n, String str) {
        List<String> ngrams = new ArrayList<String>();
        String[] words = str.split(" ");
        for (int i = 0; i < words.length - n + 1; i++)
            ngrams.add(concat(words, i, i+n));
        return ngrams;
    }

    public static String concat(String[] words, int start, int end) {
        StringBuilder sb = new StringBuilder();
        for (int i = start; i < end; i++)
            sb.append((i > start ? " " : "") + words[i]);
        return sb.toString();
    }

    public static void main(String[] args) {
        for (int n = 1; n <= 3; n++) {
            for (String ngram : ngrams(n, "This is my car."))
                System.out.println(ngram);
            System.out.println();
        }
    }
}

Output:

This
is
my
car.

This is
is my
my car.

This is my
is my car.

An "on-demand" solution implemented as an Iterator:

class NgramIterator implements Iterator<String> {

    String[] words;
    int pos = 0, n;

    public NgramIterator(int n, String str) {
        this.n = n;
        words = str.split(" ");
    }

    public boolean hasNext() {
        return pos < words.length - n + 1;
    }

    public String next() {
        StringBuilder sb = new StringBuilder();
        for (int i = pos; i < pos + n; i++)
            sb.append((i > pos ? " " : "") + words[i]);
        pos++;
        return sb.toString();
    }

    public void remove() {
        throw new UnsupportedOperationException();
    }
}
Altarpiece answered 7/9, 2010 at 8:3 Comment(0)
P
26

You are looking for ShingleFilter.

Update: The link points to version 3.0.2. This class may be in different package in newer version of Lucene.

Poteat answered 7/9, 2010 at 12:53 Comment(0)
D
6

This code returns an array of all Strings of the given length:

public static String[] ngrams(String s, int len) {
    String[] parts = s.split(" ");
    String[] result = new String[parts.length - len + 1];
    for(int i = 0; i < parts.length - len + 1; i++) {
       StringBuilder sb = new StringBuilder();
       for(int k = 0; k < len; k++) {
           if(k > 0) sb.append(' ');
           sb.append(parts[i+k]);
       }
       result[i] = sb.toString();
    }
    return result;
}

E.g.

System.out.println(Arrays.toString(ngrams("This is my car", 2)));
//--> [This is, is my, my car]
System.out.println(Arrays.toString(ngrams("This is my car", 3)));
//--> [This is my, is my car] 
Deathly answered 7/9, 2010 at 8:6 Comment(3)
ngrams("This is my car", -3) (sorry, couldn't resist)Betimes
ngrams("This is my car", -3) works fine. ngrams("This is my car", 6) however, results in a NegativeArraySizeException.Altarpiece
What do you expect in these cases? I'd suggest to put a test at the beginning of the method and return an empty array. Generally I see few SO answers with a sophisticated error handling.Deathly
A
1
/**
 * 
 * @param sentence should has at least one string
 * @param maxGramSize should be 1 at least
 * @return set of continuous word n-grams up to maxGramSize from the sentence
 */
public static List<String> generateNgramsUpto(String str, int maxGramSize) {

    List<String> sentence = Arrays.asList(str.split("[\\W+]"));

    List<String> ngrams = new ArrayList<String>();
    int ngramSize = 0;
    StringBuilder sb = null;

    //sentence becomes ngrams
    for (ListIterator<String> it = sentence.listIterator(); it.hasNext();) {
        String word = (String) it.next();

        //1- add the word itself
        sb = new StringBuilder(word);
        ngrams.add(word);
        ngramSize=1;
        it.previous();

        //2- insert prevs of the word and add those too
        while(it.hasPrevious() && ngramSize<maxGramSize){
            sb.insert(0,' ');
            sb.insert(0,it.previous());
            ngrams.add(sb.toString());
            ngramSize++;
        }

        //go back to initial position
        while(ngramSize>0){
            ngramSize--;
            it.next();
        }                   
    }
    return ngrams;
}

Call:

long startTime = System.currentTimeMillis();
ngrams = ToolSet.generateNgramsUpto("This is my car.", 3);
long stopTime = System.currentTimeMillis();
System.out.println("My time = "+(stopTime-startTime)+" ms with ngramsize = "+ngrams.size());
System.out.println(ngrams.toString());

Output:

My time = 1 ms with ngramsize = 9 [This, is, This is, my, is my, This is my, car, my car, is my car]

Avow answered 20/12, 2012 at 17:25 Comment(0)
K
1
    public static void CreateNgram(ArrayList<String> list, int cutoff) {
    try
    {
        NGramModel ngramModel = new NGramModel();
        POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
        PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
        POSTaggerME tagger = new POSTaggerME(model);
        perfMon.start();
        for(int i = 0; i<list.size(); i++)
        {
            String inputString = list.get(i);
            ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(inputString));
            String line;
            while ((line = lineStream.read()) != null) 
            {
                String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
                String[] tags = tagger.tag(whitespaceTokenizerLine);

                POSSample sample = new POSSample(whitespaceTokenizerLine, tags);

                perfMon.incrementCounter();

                String words[] = sample.getSentence();

                if(words.length > 0)
                {
                    for(int k = 2; k< 4; k++)
                    {
                        ngramModel.add(new StringList(words), k, k);
                    }
                }
            }
        }
        ngramModel.cutoff(cutoff, Integer.MAX_VALUE);
        Iterator<StringList> it = ngramModel.iterator();
        while(it.hasNext())
        {
            StringList strList = it.next();
            System.out.println(strList.toString());
        }
        perfMon.stopAndPrintFinalResult();
    }catch(Exception e)
    {
        System.out.println(e.toString());
    }
}

Here is my codes to create n-gram. In this case, n = 2, 3. n-gram of words sequence which smaller than cutoff value will ignore from result set. Input is list of sentences, then it parse using a tool of OpenNLP

Kane answered 14/5, 2013 at 2:8 Comment(0)
N
0
public static void main(String[] args) {

    String[] words = "This is my car.".split(" ");
    for (int n = 0; n < 3; n++) {

        List<String> list = ngrams(n, words);
        for (String ngram : list) {
            System.out.println(ngram);
        }
        System.out.println();

    }
}

public static List<String> ngrams(int stepSize, String[] words) {
    List<String> ngrams = new ArrayList<String>();
    for (int i = 0; i < words.length-stepSize; i++) {

        String initialWord = "";
        int internalCount = i;
        int internalStepSize = i + stepSize;
        while (internalCount <= internalStepSize
                && internalCount < words.length) {
            initialWord = initialWord+" " + words[internalCount];
            ++internalCount;
        }
        ngrams.add(initialWord);

    }
    return ngrams;
}
Niela answered 18/7, 2017 at 4:6 Comment(0)
I
0

Check this out:

public static void main(String[] args) {
    NGram nGram = new NGram();
    String[] tokens = "this is my car".split(" ");
    int i = tokens.length;
    List<String> ngrams = new ArrayList<>();
    while (i >= 1){
        ngrams.addAll(nGram.getNGram(tokens, i, new ArrayList<>()));
        i--;
    }
    System.out.println(ngrams);
}

private List<String> getNGram(String[] tokens, int n, List<String> ngrams) {
    StringBuilder strbldr = new StringBuilder();
    if (tokens.length < n) {
        return ngrams;
    }else {
        for (int i=0; i<n; i++){
            strbldr.append(tokens[i]).append(" ");
        }
        ngrams.add(strbldr.toString().trim());
        String[] newTokens = Arrays.copyOfRange(tokens, 1, tokens.length);
        return getNGram(newTokens, n, ngrams);
    }
}

Simple recursive function, better running time.

Interrupt answered 30/1, 2019 at 8:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.