How to generate n-grams in scala?

Asked 24/11, 2011 at 14:55 Answered 17/12, 2013 at 12:48

I am trying to code dissociated press algorithm based on n-gram in scala. How to generate an n-gram for a large files: For example, for the file containing "the bee is the bee of the bees".

First it has to pick a random n-gram. For example, the bee.
Then it has to look for n-grams starting with (n-1) words. For example, bee of.
it prints the last word of this n-gram. Then repeats.

Can you please give me some hints how to do it? Sorry for the inconvenience.

Boice answered 24/11, 2011 at 14:55 Comment(3)

I don't know what a n-gram is. Are you just choosing words randomly? Or has some logic? – Twoway 24/11, 2011 at 15:1

@Twoway Wikipedia is your friend: en.wikipedia.org/wiki/N-gram – Titled 24/11, 2011 at 15:2

Is this by any chance related to #8257330? – Titled 24/11, 2011 at 15:51

Your questions could be a little more specific but here is my try.

val words = "the bee is the bee of the bees"
words.split(' ').sliding(2).foreach( p => println(p.mkString))

Rodmun answered 24/11, 2011 at 15:8 Comment(2)

Not that this will only give you 2-grams. If n-grams are desired, then n needs to be parameterized. – Fridafriday 17/12, 2013 at 12:50

@Fridafriday but it can be easily adjusted – Umbilicus 20/5, 2020 at 18:36

You may try this with a parameter of n

val words = "the bee is the bee of the bees"
val w = words.split(" ")

val n = 4
val ngrams = (for( i <- 1 to n) yield w.sliding(i).map(p => p.toList)).flatMap(x => x)
ngrams foreach println

List(the)
List(bee)
List(is)
List(the)
List(bee)
List(of)
List(the)
List(bees)
List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)

Fridafriday answered 24/5, 2013 at 9:58 Comment(0)

Here is a stream based approach. This will not required too much memory while computing n-grams.

object ngramstream extends App {

  def process(st: Stream[Array[String]])(f: Array[String] => Unit): Stream[Array[String]] = st match {
    case x #:: xs => {
      f(x)
      process(xs)(f)
    }
    case _ => Stream[Array[String]]()
  }

  def ngrams(n: Int, words: Array[String]) = {
    // exclude 1-grams
    (2 to n).map { i => words.sliding(i).toStream }
      .foldLeft(Stream[Array[String]]()) {
        (a, b) => a #::: b
      }
  }

  val words = "the bee is the bee of the bees"
  val n = 4
  val ngrams2 = ngrams(n, words.split(" "))

  process(ngrams2) { x =>
    println(x.toList)
  }

}

OUTPUT:

List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)

Fridafriday answered 17/12, 2013 at 12:48 Comment(2)

I like it, not sure of the usefulness of process. Why not just do ngrams(...).foreach(x=>println(x.toList))? – Alicea 18/3, 2014 at 13:51

@Mortimer: Interesting question. process is just an additional function. We can definitely use ngrams2 foreach { x => println(x.toList)}. Thanks :-) – Fridafriday 19/3, 2014 at 11:57

Recommended topics

Hot tags