I am currently using uni-grams in my word2vec model as follows.
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
#Returns a list of sentences, where each sentence is a list of words
#
#NLTK tokenizer to split the paragraph into sentences
raw_sentences = tokenizer.tokenize(review.strip())
sentences = []
for raw_sentence in raw_sentences:
# If a sentence is empty, skip it
if len(raw_sentence) > 0:
# Otherwise, call review_to_wordlist to get a list of words
sentences.append( review_to_wordlist( raw_sentence, \
remove_stopwords ))
#
# Return the list of sentences (each sentence is a list of words,
# so this returns a list of lists
return sentences
However, then I will miss important bigrams and trigrams in my dataset.
E.g.,
"team work" -> I am currently getting it as "team", "work"
"New York" -> I am currently getting it as "New", "York"
Hence, I want to capture the important bigrams, trigrams etc. in my dataset and input into my word2vec model.
I am new to wordvec and struggling how to do it. Please help me.