I have a pyspark dataframe with a corpus of ~300k unique rows each with a "doc" that contains a few sentences of text in each.
After processing, I have a 200 dimension vectorized representation of each row/doc. My NLP Process:
- Remove Punctuation with regex udf
- Word Stemming with nltk snowball udf)
- Pyspark Tokenizer
- Word2Vec (ml.feature.Word2Vec, vectorSize=200, windowSize=5)
I understand how this implementation uses the skipgram model to create embeddings for each word based on the full corpus used. My question is: How does this implementation go from a vector for each word in the corpus to a vector for each document/row?
Is it the same processes as in the gensim doc2vec implementation where it simply concatenates the word vectors in each doc together?: How does gensim calculate doc2vec paragraph vectors. If so, how does it cut the vector down to the specified size of 200 (Does it use just the first 200 words? Average?)?
I was unable to find the information from the sourcecode: https://spark.apache.org/docs/2.2.0/api/python/_modules/pyspark/ml/feature.html#Word2Vec
Any help or reference material to look at is super appreciated!