NLP: Pre-processing in doc2vec / word2vec

A few papers on the topics of word and document embeddings (word2vec, doc2vec) mention that they used the Stanford CoreNLP framework to tokenize/lemmatize/POS-tag the input words/sentences:

The corpora were lemmatized and POS-tagged with the Stanford CoreNLP (Manning et al., 2014) and each token was replaced with its lemma and POS tag

(http://www.ep.liu.se/ecp/131/039/ecp17131039.pdf)

For pre-processing, we tokenise and lowercase the words using Stanford CoreNLP

(https://arxiv.org/pdf/1607.05368.pdf)

So my questions are:

Why does the first paper apply POS-tagging? Would each token then be replaced with something like {lemma}_{POS} and the whole thing used to train the model? Or are the tags used to filter tokens? For example, gensims WikiCorpus applies lemmatization per default and then only keeps a few types of part of speech (verbs, nouns, etc.) and gets rid of the rest. So what is the recommended way?
The quote from the second paper seems to me like they only split up words and then lowercase them. This is also what I first tried before I used WikiCorpus. In my opinion, this should give better results for document embeddings as most of POS types contribute to the meaning of a sentence. Am I right?

In the original doc2vec paper I did not find details about their pre-processing.

For your first question, the answer is "it depends on what you are trying to accomplish!"

There isn't a recommended way per say, to pre-process text. To clean a text corpus, usually the first steps are tokenization and lemmatization. Next, to remove not important terms/tokens, you can remove stop-words or even apply POS tags, to be able to remove tokens based on their grammatical category, based on the assumption that some grammatical categories (such as adjectives), do not contain valuable information for modelling a topic for example. But this purely depends on the type of analysis you are going to follow after the pre-processing step.

For you second part of the question, as explained above, tokenisation and lower case tokens, are standard parts of the pre-processing routine. So I also suspect, that regardless of the ML algorithm used later on, your results will be better if you carefully pre-process your data. I am not sure whether POS tags contribute to the meaning of a sentence though.

Hope I provided some valuable feedback to your research. If not you could provide a code sample to further discuss this issue.

Recommended topics

Hot tags