How to generate bi/tri-grams using spacy/nltk
Asked Answered
M

3

12

The input text are always list of dish names where there are 1~3 adjectives and a noun

Inputs

thai iced tea
spicy fried chicken
sweet chili pork
thai chicken curry

outputs:

thai tea, iced tea
spicy chicken, fried chicken
sweet pork, chili pork
thai chicken, chicken curry, thai curry

Basically, I am looking to parse the sentence tree and try to generate bi-grams by pairing an adjective with the noun.

And I would like to achieve this with spacy or nltk

Mizuki answered 31/8, 2016 at 5:53 Comment(1)
M
6

I used spacy 2.0 with english model. To find nouns and "not-nouns" to parse the input and then I put together not-nouns and nouns to create a desired output.

Your input:

s = ["thai iced tea",
"spicy fried chicken",
"sweet chili pork",
"thai chicken curry",]

Spacy solution:

import spacy
nlp = spacy.load('en') # import spacy, load model

def noun_notnoun(phrase):
    doc = nlp(phrase) # create spacy object
    token_not_noun = []
    notnoun_noun_list = []

    for item in doc:
        if item.pos_ != "NOUN": # separate nouns and not nouns
            token_not_noun.append(item.text)
        if item.pos_ == "NOUN":
            noun = item.text

    for notnoun in token_not_noun:
        notnoun_noun_list.append(notnoun + " " + noun)

    return notnoun_noun_list

Call function:

for phrase in s:
    print(noun_notnoun(phrase))

Results:

['thai tea', 'iced tea']
['spicy chicken', 'fried chicken']
['sweet pork', 'chili pork']
['thai chicken', 'curry chicken']
Misspell answered 16/2, 2018 at 13:45 Comment(2)
It would be good if you could describe your algorithm in words (not just code). What is it supposed to do? Will it work for longer sequences? I noticed that your approach doesn't preserve word order; for example, the output contains "curry chicken" although "curry" never appears before "chicken" in the input.Chaps
Added some comments. Yep, it does not - I have not considered that as an requirement.Misspell
C
5

You can achieve this in a few steps with NLTK:

  1. PoS tag the sequences

  2. generate the desired n-grams (in your examples there are no trigrams, but skip-grams which can be generated through trigrams and then punching out the middle token)

  3. discard all n-grams that don't match the pattern JJ NN.

Example:

def jjnn_pairs(phrase):
    '''
    Iterate over pairs of JJ-NN.
    '''
    tagged = nltk.pos_tag(nltk.word_tokenize(phrase))
    for ngram in ngramise(tagged):
        tokens, tags = zip(*ngram)
        if tags == ('JJ', 'NN'):
            yield tokens

def ngramise(sequence):
    '''
    Iterate over bigrams and 1,2-skip-grams.
    '''
    for bigram in nltk.ngrams(sequence, 2):
        yield bigram
    for trigram in nltk.ngrams(sequence, 3):
        yield trigram[0], trigram[2]

Extend the pattern ('JJ', 'NN') and the desired n-grams to your needs.

I think there is no need for parsing. The major problem of this approach, however, is that most PoS taggers will probably not tag everything exactly the way you want. For example, the default PoS tagger of my NLTK installation tagged "chili" as NN, not JJ, and "fried" got VBD. Parsing won't help you with that, though!

Chaps answered 31/8, 2016 at 6:33 Comment(0)
U
1

Something like this:

>>> from nltk import bigrams
>>> text = """thai iced tea
... spicy fried chicken
... sweet chili pork
... thai chicken curry"""
>>> lines = map(str.split, text.split('\n'))
>>> for line in lines:
...     ", ".join([" ".join(bi) for bi in bigrams(line)])
... 
'thai iced, iced tea'
'spicy fried, fried chicken'
'sweet chili, chili pork'
'thai chicken, chicken curry'

Alternatively using colibricore https://proycon.github.io/colibri-core/doc/#installation ;P

Unexacting answered 31/8, 2016 at 8:52 Comment(1)
Hey Alvas, I am specifically trying to avoid adjective adjective. e.g. specifically trying to avoid "spicy fried"Mizuki

© 2022 - 2024 — McMap. All rights reserved.