Naive-bayes multinomial text classifier using Data frame in Scala Spark

I am trying to build a NaiveBayes classifier, loading the data from database as DataFrame which contains (label, text). Here's the sample of data (multinomial label):

label|             feature|
+-----+--------------------+
|    1|combusting prepar...|
|    1|adhesives for ind...|
|    1|                    |
|    1| salt for preserving|
|    1|auxiliary fluids ...|

I have used following transformation for tokenization, stopword, n-gram, and hashTF :

val selectedData = df.select("label", "feature")
// Tokenize RDD
val tokenizer = new Tokenizer().setInputCol("feature").setOutputCol("words")
val regexTokenizer = new   RegexTokenizer().setInputCol("feature").setOutputCol("words").setPattern("\\W")
val tokenized = tokenizer.transform(selectedData)
tokenized.select("words", "label").take(3).foreach(println)

// Removing stop words
val remover = new        StopWordsRemover().setInputCol("words").setOutputCol("filtered")
val parsedData = remover.transform(tokenized) 

// N-gram
val ngram = new NGram().setInputCol("filtered").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(parsedData) 
ngramDataFrame.take(3).map(_.getAs[Stream[String]]("ngrams").toList).foreach(println)

//hashing function
val hashingTF = new HashingTF().setInputCol("ngrams").setOutputCol("hash").setNumFeatures(1000)
val featurizedData = hashingTF.transform(ngramDataFrame)

Output of the transformation:

+-----+--------------------+--------------------+--------------------+------    --------------+--------------------+
|label|             feature|               words|            filtered|                  ngrams|                hash|
+-----+--------------------+--------------------+--------------------+------    --------------+--------------------+
|    1|combusting prepar...|[combusting, prep...|[combusting, prep...|    [combusting prepa...|(1000,[124,161,69...|
|    1|adhesives for ind...|[adhesives, for, ...|[adhesives, indus...| [adhesives indust...|(1000,[451,604],[...|
|    1|                    |                  []|                  []|                     []|        (1000,[],[])|
|    1| salt for preserving|[salt, for, prese...|  [salt, preserving]|   [salt   preserving]|  (1000,[675],[1.0])|
|    1|auxiliary fluids ...|[auxiliary, fluid...|[auxiliary, fluid...|[auxiliary fluids...|(1000,[661,696,89...|

To build a Naive Bayes model, I need to convert the label and feature into LabelPoint. Following approaches I have tried to convert a dataframe into RDD and create labelpoint:

val rddData = featurizedData.select("label","hash").rdd

val trainData = rddData.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0), parts(1))
}


val rddData = featurizedData.select("label","hash").rdd.map(r =>   (Try(r(0).asInstanceOf[Integer]).get.toDouble,   Try(r(1).asInstanceOf[org.apache.spark.mllib.linalg.SparseVector]).get))

val trainData = rddData.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble,   Vectors.dense(parts(1).split(',').map(_.toDouble)))
}

I am getting the following error:

 scala> val trainData = rddData.map { line =>
 |   val parts = line.split(',')
 |   LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(',').map(_.toDouble)))
 | }
 <console>:67: error: value split is not a member of (Double,    org.apache.spark.mllib.linalg.SparseVector)
     val parts = line.split(',')
                      ^
<console>:68: error: not found: value Vectors
     LabeledPoint(parts(0).toDouble,   Vectors.dense(parts(1).split(',').map(_.toDouble)))

Edit 1:

As per below suggestion, I have created the LabelPoint and train the Model.

val trainData = featurizedData.select("label","features")

val trainLabel = trainData.map(line =>  LabeledPoint(Try(line(0).asInstanceOf[Integer]).get.toDouble,Try(line(1).asInsta nceOf[org.apache.spark.mllib.linalg.SparseVector]).get))

val splits = trainLabel.randomSplit(Array(0.8, 0.2), seed = 11L)
val training = splits(0)
val test = splits(1)

val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial")

val predictionAndLabels = test.map { point => 
   val score = model.predict(point.features)
   (score, point.label)}

I am getting less accuracy around 40% with N-gram and without N-gram along with different hash feature number. My dataset contains 5000 row and 45 mutlinomial label. Is there any way to improve the model performance? Thanks in advance

You don't need to transform your featurizedData into an RDD, because Apache Spark has two libraries ML and MLLib, the first one works with DataFrames, whereas MLLib works using RDDs. Therefore, you can work with ML because you already have a DataFrame.

In order to achieve this, you just need to rename your columns to (label, features), and fit your model, as they show in NaiveBayes, example bellow.

df = sqlContext.createDataFrame([
    Row(label=0.0, features=Vectors.dense([0.0, 0.0])),
    Row(label=0.0, features=Vectors.dense([0.0, 1.0])),
    Row(label=1.0, features=Vectors.dense([1.0, 0.0]))])
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
model = nb.fit(df)

About the error you get, is because you already have a SparseVector, and that class doesn't have a split method. So thinking more about this, your RDD almost has the structure you actually require, but you have to convert the Tuple to a LabeledPoint.

There are some techniques to improve the performance, the first one that comes to my mind is to remove stopwords (e.g. the, a, an, to, although, etc...), the second one is to count the number of different words in your texts and then construct the vectors manually, i.e. this is because if the hashing number is low then different words might have the same hash, hence a bad performance.

Recommended topics

Hot tags