I tried standard spark HashingTF example on DataBricks.
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
val sentenceData = spark.createDataFrame(Seq(
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
)).toDF("label", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
display(featurizedData)
I have diffuculty in understanding result below. Please see the image When numFeatures is 20
[0,20,[0,5,9,17],[1,1,1,2]]
[0,20,[2,7,9,13,15],[1,1,3,1,1]]
[0,20,[4,6,13,15,18],[1,1,1,1,1]]
If [0,5,9,17] are hash values
and [1,1,1,2] are frequencies.
17 has frequency 2
9 has 3 (it has 2)
13,15 have 1 while they must have 2.
Probably I am missing something. Could not find documentation of detailed explanation.