Get bigrams and trigrams in word2vec Gensim

Asked 9/9, 2017 at 9:49 Answered 9/5, 2019 at 6:1

Solved python tokenize word2vec gensim n-gram

I am currently using uni-grams in my word2vec model as follows.

def review_to_sentences( review, tokenizer, remove_stopwords=False ):
    #Returns a list of sentences, where each sentence is a list of words
    #
    #NLTK tokenizer to split the paragraph into sentences
    raw_sentences = tokenizer.tokenize(review.strip())

    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append( review_to_wordlist( raw_sentence, \
              remove_stopwords ))
    #
    # Return the list of sentences (each sentence is a list of words,
    # so this returns a list of lists
    return sentences

However, then I will miss important bigrams and trigrams in my dataset.

E.g.,
"team work" -> I am currently getting it as "team", "work"
"New York" -> I am currently getting it as "New", "York"

Hence, I want to capture the important bigrams, trigrams etc. in my dataset and input into my word2vec model.

I am new to wordvec and struggling how to do it. Please help me.

Putput answered 9/9, 2017 at 9:49 Comment(2)

Provide some code and a better example. The example you're showing doesnt reflect the data you provided in the first line – Audly 9/9, 2017 at 9:52

Done! Updated the question. Please help me to solve this issue. – Putput 9/9, 2017 at 12:39

First of all you should use gensim's class Phrases in order to get bigrams, which works as pointed in the doc

>>> bigram = Phraser(phrases)
>>> sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
>>> print(bigram[sent])
[u'the', u'mayor', u'of', u'new_york', u'was', u'there']

To get trigrams and so on, you should use the bigram model that you already have and apply Phrases to it again, and so on. Example:

trigram_model = Phrases(bigram_sentences)

Also there is a good notebook and video that explains how to use that .... the notebook, the video

The most important part of it is how to use it in real life sentences which is as follows:

// to create the bigrams
bigram_model = Phrases(unigram_sentences)

// apply the trained model to a sentence
 for unigram_sentence in unigram_sentences:                
            bigram_sentence = u' '.join(bigram_model[unigram_sentence])

// get a trigram model out of the bigram
trigram_model = Phrases(bigram_sentences)

Hope this helps you, but next time give us more information on what you are using and etc.

P.S: Now that you edited it, you are not doing anything in order to get bigrams just splitting it, you have to use Phrases in order to get words like New York as bigrams.

Charismatic answered 9/9, 2017 at 12:56 Comment(4)

Thank you for your valuable answer. But when I use bigram = Phraser(phrases). it says undefined name Phraser and phrases. Do I need to import them? – Putput 9/9, 2017 at 14:50

@Volka Yes you need to import them, it is in the models of gensim, I know gensim docs are confusing sometimes – Charismatic 9/9, 2017 at 15:24

@Charismatic Please let me know if you know an answer for this #46138072 – Rafaello 10/9, 2017 at 5:41

Generally it is good to remove stop words and stem after you created your n-gram dictionary. – Expansionism 25/12, 2018 at 21:57

from gensim.models import Phrases

from gensim.models.phrases import Phraser

documents = [
  "the mayor of new york was there", 
  "machine learning can be useful sometimes",
  "new york mayor was present"
  ]

sentence_stream = [doc.split(" ") for doc in documents]
print(sentence_stream)

bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')

bigram_phraser = Phraser(bigram)


print(bigram_phraser)

for sent in sentence_stream:
    tokens_ = bigram_phraser[sent]

    print(tokens_)

Perique answered 12/1, 2018 at 19:48 Comment(5)

@Putput you need to import below from gensim.models import Phrases from gensim.models.phrases import Phraser – Perique 12/1, 2018 at 19:49

It would be nice to know the output of Phrases and Phraser and what bigram and bigram_phraser looks like. What about Word2Vec with sg=1, for skip gram=1 with negative sampling and window – Squilgee 3/7, 2019 at 7:22

@tgrandje when I run your code above, it works till print(sentence_stream) but when I get to bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ') I get following error: TypeError: sequence item 0: expected a bytes-like object, str found – Pierrepierrepont 7/8, 2022 at 17:28

@Pierrepierrepont : that's not my code (I just fixed a typo), but I seem to remember you can replace the b' ' délimiter by a simple ' '. I'm not sure if this code could date to python 2.x so I didn't take the liberty to change that... – Jemy 8/8, 2022 at 20:53

I replaced the delimiter with ___ and it worked. Basically, the delimiter open uses the delimiter provided to connect the words to form phrases (e.g. bigram or trigram) – Pierrepierrepont 10/8, 2022 at 17:22

Phrases and Phraser are those you should looking for

bigram = gensim.models.Phrases(data_words, min_count=1, threshold=10) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)

Once you are enough done with adding vocabs then use Phraser for faster access and efficient memory usage. Not mandatory but useful.

bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

Thanks,

Siccative answered 9/5, 2019 at 6:1 Comment(1)

How to use it for train and test data? I want to use train data to learn phrases and then transform it to test. How can I do that? – Litt 10/9, 2019 at 13:27

Recommended topics

Hot tags