How does Pyspark Calculate Doc2Vec from word2vec word embeddings?
Asked Answered
P

2

8

I have a pyspark dataframe with a corpus of ~300k unique rows each with a "doc" that contains a few sentences of text in each.

After processing, I have a 200 dimension vectorized representation of each row/doc. My NLP Process:

  1. Remove Punctuation with regex udf
  2. Word Stemming with nltk snowball udf)
  3. Pyspark Tokenizer
  4. Word2Vec (ml.feature.Word2Vec, vectorSize=200, windowSize=5)

I understand how this implementation uses the skipgram model to create embeddings for each word based on the full corpus used. My question is: How does this implementation go from a vector for each word in the corpus to a vector for each document/row?

Is it the same processes as in the gensim doc2vec implementation where it simply concatenates the word vectors in each doc together?: How does gensim calculate doc2vec paragraph vectors. If so, how does it cut the vector down to the specified size of 200 (Does it use just the first 200 words? Average?)?

I was unable to find the information from the sourcecode: https://spark.apache.org/docs/2.2.0/api/python/_modules/pyspark/ml/feature.html#Word2Vec

Any help or reference material to look at is super appreciated!

Preference answered 2/1, 2018 at 16:20 Comment(0)
E
5

One simple way to go from word-vectors, to a single vector for a range-of-text, is to average the vectors together. And, that often works well-enough for some tasks.

However, that's not how the Doc2Vec class in gensim does it. That class implements the 'Paragraph Vectors' technique, where separate document-vectors are trained in a manner analogous to word-vectors.

The doc-vectors participate in training a bit like a floating synthetic word, involved in every sliding window/target-word-prediction. They're not composed-up or concatenated-from preexisting word-vectors, though in some modes they may be simultaneously trained alongside word-vectors. (However, the fast and often top-performing PV-DBOW mode, enabled in gensim with the parameter dm=0, doesn't train or use input-word-vectors at all. It just trains doc-vectors that are good for predicting the words in each text-example.)

Since you've mentioned multiple libraries (both Spark MLib and gensim), but you've not shown your code, it's not certain exactly what your existing process is doing.

Euchromosome answered 3/1, 2018 at 15:6 Comment(3)
I'm exclusively using pyspark.ml (feature and functions) under the pretense that it will be faster than gensim. I found more details in the scala source code from another response here: github.com/apache/spark/blob/v2.2.0/mllib/src/main/scala/org/… It looks like the 'average the vectors together' approach. Thanks for the information!Preference
The Paragraph Vectors article describes a two step estimator, first get coefficients for the word vectors, then fix them across all paragraphs and do the higher level para calculations. Since no two paragraphs are identical, the math requires some meta magic, similar to hierarchical Bayesian regression (imagine having characteristics of paragraphs that are abstract enough to allow you to characterize a new paragraph). I don't think it is literally an average, in the Paragraph Vector article it says it is not.Emmet
No, the process in the original 'Paragraph Vector' paper does not describe a 2-step process which 1st calculates "coefficients for the word-vectors", then later does "higher-level" paragraph calculations. The paragraph-vectors are simultaneously trained with any word-vectors, in a manner very highly analogous to word2vec training – without any "meta magic". There's nothing in the paper's description or typical implementations which enforces any "no two paragraphs are identical" constraint.Euchromosome
W
2

In Pyspark, ml.feature.Word2Vec is used to get the called doc2vec by calculating the average of word2vecs with the weight of term frequency (TF) in each doc. You can study the result of the official example in https://spark.apache.org/docs/2.2.0/api/python/_modules/pyspark/ml/feature.html#Word2Vec

Wafd answered 28/3, 2019 at 2:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.