Spark HashingTF result explanation
Asked Answered
M

2

6

I tried standard spark HashingTF example on DataBricks.

import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}

val sentenceData = spark.createDataFrame(Seq(
  (0, "Hi I heard about Spark"),
  (0, "I wish Java could use case classes"),
  (1, "Logistic regression models are neat")
)).toDF("label", "sentence")

val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF()
  .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
display(featurizedData)

I have diffuculty in understanding result below. Please see the image When numFeatures is 20

[0,20,[0,5,9,17],[1,1,1,2]]
[0,20,[2,7,9,13,15],[1,1,3,1,1]]
[0,20,[4,6,13,15,18],[1,1,1,1,1]]

If [0,5,9,17] are hash values
and [1,1,1,2] are frequencies.
17 has frequency 2
9 has 3 (it has 2)
13,15 have 1 while they must have 2.

Probably I am missing something. Could not find documentation of detailed explanation.

Morbilli answered 14/12, 2016 at 22:13 Comment(1)
Spark class HashingTF utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing.Ambivalence
S
1

As mcelikkaya notes, the output frequencies are not what you would expect. This is due to hash collisions for a small number of features, 20 in this case. I have added some words to the input data (for illustration purposes) and upped features to 20,000, and then the correct frequencies are produced:

+-----+---------------------------------------------------------+-------------------------------------------------------------------------+--------------------------------------------------------------------------------------+
|label|sentence                                                 |words                                                                    |rawFeatures                                                                           |
+-----+---------------------------------------------------------+-------------------------------------------------------------------------+--------------------------------------------------------------------------------------+
|0    |Hi hi hi hi I i i i i heard heard heard about Spark Spark|[hi, hi, hi, hi, i, i, i, i, i, heard, heard, heard, about, spark, spark]|(20000,[3105,9357,11777,11960,15329],[2.0,3.0,1.0,4.0,5.0])                           |
|0    |I i wish Java could use case classes spark               |[i, i, wish, java, could, use, case, classes, spark]                     |(20000,[495,3105,3967,4489,15329,16213,16342,19809],[1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0])|
|1    |Logistic regression models are neat                      |[logistic, regression, models, are, neat]                                |(20000,[286,1193,9604,13138,18695],[1.0,1.0,1.0,1.0,1.0])                             |
+-----+---------------------------------------------------------+-------------------------------------------------------------------------+------------------------------------------------------------
Social answered 22/7, 2017 at 6:35 Comment(0)
P
0

Your guesses are correct:

  • 20 - is a vector size
  • first list is a list of indices
  • second list is a list of values

Leading 0 is just an artifact of internal representation.

There is nothing more here to learn.

Periotic answered 14/12, 2016 at 22:57 Comment(2)
Dear cricket , Thanks for answer. If second list for values( i assume it is frequencies). Do not you see any abnormal value in frequencies? How could 17 have frequency 2?Morbilli
@mcelikkaya, I asked this same question at Cross Validated and posted my findings in case it helps. Thanks!Rowell

© 2022 - 2024 — McMap. All rights reserved.