How to create a bigram from a text file with frequency count in Spark/Scala?

Asked 18/4, 2015 at 3:28 Answered 18/4, 2015 at 6:29

I want to take a text file and create a bigram of all words not separated by a dot ".", removing any special characters. I'm trying to do this using Spark and Scala.

This text:

Hello my Friend. How are
you today? bye my friend.

Should produce the following:

hello my, 1
my friend, 2
how are, 1
you today, 1
today bye, 1
bye my, 1

Cupping answered 18/4, 2015 at 3:28 Comment(0)

For each of the lines in the RDD, start by splitting based on '.'. Then tokenize each of the resulting substrings by splitting on ' '. Once tokenized, remove special characters with replaceAll and convert to lowercase. Each of these sublists can be converted with sliding to an iterator of string arrays containing bigrams.

Then, after flattening and converting the bigram arrays to strings with mkString as requested, get a count for each one with groupBy and mapValues.

Finally flatten, reduce, and collect the (bigram, count) tuples from the RDD.

val rdd = sc.parallelize(Array("Hello my Friend. How are",
                               "you today? bye my friend."))

rdd.map{ 

    // Split each line into substrings by periods
    _.split('.').map{ substrings =>

        // Trim substrings and then tokenize on spaces
        substrings.trim.split(' ').

        // Remove non-alphanumeric characters, using Shyamendra's
        // clean replacement technique, and convert to lowercase
        map{_.replaceAll("""\W""", "").toLowerCase()}.

        // Find bigrams
        sliding(2)
    }.

    // Flatten, and map the bigrams to concatenated strings
    flatMap{identity}.map{_.mkString(" ")}.

    // Group the bigrams and count their frequency
    groupBy{identity}.mapValues{_.size}

}.

// Reduce to get a global count, then collect
flatMap{identity}.reduceByKey(_+_).collect.

// Format and print
foreach{x=> println(x._1 + ", " + x._2)}

you today, 1
hello my, 1
my friend, 2
how are, 1
bye my, 1
today bye, 1

Blackcap answered 18/4, 2015 at 4:0 Comment(6)

Thanks @ohruunuruus. When I try to output this to a file I get: '[Ljava.lang.String;@7358dbec [Ljava.lang.String;@4ece9e1d [Ljava.lang.String;@6f124cb [Ljava.lang.String;@41a68efc [Ljava.lang.String;@1df56410 [Ljava.lang.String;@5800bbcf [Ljava.lang.String;@7ddb1518 [Ljava.lang.String;@3a461b35 ' Any pointers? – Cupping 18/4, 2015 at 4:19

@Cupping I've got a workaround for that now, though I'm not certain this is the best way to do it – Blackcap 18/4, 2015 at 4:34

would be easy to just concatenate the pairs? – Cupping 18/4, 2015 at 4:38

I'm trying to run it, but I get this error: [error] found : Array[String] => Array[String] [error] required: Array[String] => scala.collection.GenTraversableOnce[?] [error] .trim.split(' ').sliding(2).flatMap(identity).map{.mkString(" ")} – Cupping 18/4, 2015 at 5:18

We can continue this discussion in chat if you're interested. – Blackcap 18/4, 2015 at 5:20

@Cupping Not sure if you're still working on this, but I believe all the kinks are worked out now – Blackcap 18/4, 2015 at 17:46

In order to separate entire words from any punctuation marks consider for instance

val words = text.split("\\W+")

which delivers in this case

Array[String] = Array(Hello, my, Friend, How, are, you, today, bye, my, friend)

Pairing words into tuples proves more inlined with the concept of a bigram, thus consider for instance

for( Array(a,b,_*) <- words.sliding(2).toArray ) 
yield (a.toLowerCase(), b.toLowerCase())

which yields

Array((hello,my), (my,friend), (friend,How), (how,are), 
      (are,you), (you,today), (today,bye), (bye,my), (my,friend))

The answer by ohruunuruus conveys otherwise a concise approach.

Theism answered 18/4, 2015 at 6:26 Comment(1)

Nice work on the split, though this will create a bigram for "friend how" which OP doesn't want. You're totally right about bigrams and tuples. – Blackcap 18/4, 2015 at 6:31

This should work in Spark:

def bigramsInString(s: String): Array[((String, String), Int)] = { 

    s.split("""\.""")                        // split on .
     .map(_.split(" ")                       // split on space
           .filter(_.nonEmpty)               // remove empty string
           .map(_.replaceAll("""\W""", "")   // remove special chars
                 .toLowerCase)
           .filter(_.nonEmpty)                
           .sliding(2)                       // take continuous pairs
           .filter(_.size == 2)              // sliding can return partial
           .map{ case Array(a, b) => ((a, b), 1) })
     .flatMap(x => x)                         
}

val rdd = sc.parallelize(Array("Hello my Friend. How are",
                               "you today? bye my friend."))

rdd.map(bigramsInString)
   .flatMap(x => x)             
   .countByKey                   // get result in driver memory as Map
   .foreach{ case ((x, y), z) => println(s"${x} ${y}, ${z}") }

// my friend, 2
// how are, 1
// today bye, 1
// bye my, 1
// you today, 1
// hello my, 1

Longobard answered 18/4, 2015 at 6:29 Comment(2)

I'm getting this error: scala.MatchError: [Ljava.lang.String;@3b2c2ef (of class [Ljava.lang.String;) in the line: ".map{ case Array(a, b) => ((a, b), 1) })" – Cupping 18/4, 2015 at 7:39

@Cupping I see. sliding returns partial list in the last array if there are less elements. Updated the answer. – Longobard 18/4, 2015 at 9:56

Recommended topics

Hot tags