How are number of iterations and number of partitions releated in Apache spark Word2Vec?
Asked Answered
L

1

15

According to mllib.feature.Word2Vec - spark 1.3.1 documentation [1]:

def setNumIterations(numIterations: Int): Word2Vec.this.type

Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.

def setNumPartitions(numPartitions: Int): Word2Vec.this.type

Sets number of partitions (default: 1). Use a small number for accuracy.

But in this Pull Request [2]:

To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.

Questions:

  • How do the parameters numIterations & numPartitions effect the internal working of the algorithm?

  • Is there a trade-off between setting the number of partitions and number of iterations considering the following rules ?

    • more accuracy -> more iteration a/c to [2]

    • more iteration -> more partition a/c to [1]

    • more partition -> less accuracy

Lovieloving answered 2/6, 2016 at 4:53 Comment(0)
S
3

When increasing the number of partitions, you decrease the amount of data each partition is trained on, thus making each training step (word vector adjustment) more "noisy" and less sure. Spark's implementation responds to this by decreasing the learning rate when you increase the number of partitions, since there are more processes updating the vector weights.

Sixpack answered 27/4, 2018 at 9:3 Comment(2)
@renner2 I still can't understand why numIterations<=numPartitions. How are they related?Lovieloving
Looking at the Word2Vec code (github.com/apache/spark/blob/v2.2.1/mllib/src/main/scala/org/…), I can't find a good answer. This doesn't mean there is no theoretical reason for numIter <= numPart, just that I couldn't find it. In practice, when I have ran word2vec with numIter > numPart, sometimes the word vector values are huge, indicating a exploding gradient problem. If you look at the code, there is a comment about maybe discounting the learning rate by iteration, so maybe in future version this problem will be solved (I'm using spark 2.2.1)Sixpack

© 2022 - 2024 — McMap. All rights reserved.