Is there a bi gram or tri gram feature in Spacy?

Asked 3/12, 2018 at 16:50 Answered 20/1, 2021 at 17:47

Solved python-3.x nlp tokenize spacy n-gram

The below code breaks the sentence into individual tokens and the output is as below

 "cloud"  "computing"  "is" "benefiting"  " major"  "manufacturing"  "companies"


import en_core_web_sm
nlp = en_core_web_sm.load()

doc = nlp("Cloud computing is benefiting major manufacturing companies")
for token in doc:
    print(token.text)

What I would ideally want is, to read 'cloud computing' together as it is technically one word.

Basically I am looking for a bi gram. Is there any feature in Spacy that allows Bi gram or Tri grams ?

Anderson answered 3/12, 2018 at 16:50 Comment(1)

@chirag. I have seen that solution. I think you are referring to this. #39242209. But it is a hack. It does not solve the problem head on.Not to mention so many additional lines of code in that noun chunk approach. – Anderson 4/12, 2018 at 9:28

Spacy allows the detection of noun chunks. So to parse your noun phrases as single entities do this:

Detect the noun chunks https://spacy.io/usage/linguistic-features#noun-chunks
Merge the noun chunks
Do dependency parsing again, it would parse "cloud computing" as single entity now.

>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp("Cloud computing is benefiting major manufacturing companies")
>>> list(doc.noun_chunks)
[Cloud computing, major manufacturing companies]
>>> for noun_phrase in list(doc.noun_chunks):
...     noun_phrase.merge(noun_phrase.root.tag_, noun_phrase.root.lemma_, noun_phrase.root.ent_type_)
... 
Cloud computing
major manufacturing companies
>>> [(token.text,token.pos_) for token in doc]
[('Cloud computing', 'NOUN'), ('is', 'VERB'), ('benefiting', 'VERB'), ('major manufacturing companies', 'NOUN')]

Broadleaved answered 4/12, 2018 at 11:45 Comment(3)

Thanks for your answer but the solution you provided is a 'way around' rather than a universal solution. Take an example of this text doc = nlp("Big data cloud computing cyber security machine learning"). It is not a coherent sentence but rather a collection of words. In this case I don't get cloud computing I get ['Big data cloud', 'cyber security machine learning'] – Anderson 4/12, 2018 at 20:5

Because thats the way it is, it is trained on coherent sentences having good grammatical structure. What you are looking for is specifically like NER for which you would have to train your models for your use case. – Broadleaved 5/12, 2018 at 6:32

Update 2022 you should use this instead now: nlp.add_pipe("merge_noun_chunks") – Beverage 2/4, 2022 at 7:21

If you have a spacy doc, you can pass it to textacy:

ngrams = list(textacy.extract.basics.ngrams(doc, 2, min_freq=2))

Caddoan answered 10/3, 2020 at 9:56 Comment(4)

How would I do this for a list of documents? Just keep appending to the ngrams list? – Dogoodism 28/12, 2020 at 23:8

@AditSanghvi Using extend or via list comprehension. – Caddoan 29/12, 2020 at 15:34

Link is broken. Find the github and useful links here – Plutonium 19/4, 2021 at 16:55

@Plutonium I updated the link and the module name – Caddoan 28/4, 2021 at 22:10

Warning: This is just an extension of the right answer made by Zuzana.

My reputation does not allow me to comment so I am making this answer just to answer the question of Adit Sanghvi above: "How do you do it when you have a list of documents?"

First you need to create a list with the text of the documents
Then you join the text lists in just one document
now you use the spacy parser to transform the text document in a Spacy document
You use the Zuzana's answer's to create de bigrams

This is the example code:

Step 1

doc1 = ['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code']
doc2 = ['how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy']
doc3 = ['i love to repeat phrases to make bigrams because i love  make bigrams']
listOfDocuments = [doc1,doc2,doc3]
textList = [''.join(textList) for text in listOfDocuments for textList in text]
print(textList)

This will print this text:

['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code', 'how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy', 'i love to repeat phrases to make bigrams because i love make bigrams']

then step 2 and 3:

doc = ' '.join(textList)
spacy_doc = parser(doc)
print(spacy_doc)

and will print this:

all what i want is that you give me back my code because i worked a lot on it. Just give me back my code how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy i love to repeat phrases to make bigrams because i love make bigrams

Finally step 4 (Zuzana's answer)

ngrams = list(textacy.extract.ngrams(spacy_doc, 2, min_freq=2))
print(ngrams)

will print this:

[make bigrams, make bigrams, make bigrams]

Johnniejohnny answered 20/1, 2021 at 17:47 Comment(0)

I had a similar problem (bigrams, trigrams, like your "cloud computing"). I made a simple list of the n-grams, word_3gram, word_2grams etc., with the gram as basic unit (cloud_computing).

Assume I have the sentence "I like cloud computing because it's cheap". The sentence_2gram is: "I_like", "like_cloud", "cloud_computing", "computing_because" ... Comparing that your bigram list only "cloud_computing" is recognized as a valid bigram; all other bigrams in the sentence are artificial. To recover all other words you just take the first part of the other words,

"I_like".split("_")[0] -> I; 
"like_cloud".split("_")[0] -> like
"cloud_computing" -> in bigram list, keep it. 
  skip next bi-gram "computing_because" ("computing" is already used)
"because_it's".split("_")[0]" -> "because" etc.

To also capture the last word in the sentence ("cheap") I added the token "EOL". I implemented this in python, and the speed was OK (500k words in 3min), i5 processor with 8G. Anyway, you have to do it only once. I find this more intuitive than the official (spacy-style) chunk approach. It also works for non-spacy frameworks.

I do this before the official tokenization/lemmatization, as you would get "cloud compute" as possible bigram. But I'm not certain if this is the best/right approach.

Pyrolysis answered 18/1, 2019 at 14:12 Comment(0)

Recommended topics

Hot tags