How to compute skipgrams in python?
Asked Answered
S

5

20

A k skipgram is an ngram which is a superset of all ngrams and each (k-i )skipgram till (k-i)==0 (which includes 0 skip grams). So how to efficiently compute these skipgrams in python?

Following is the code i tried but it is not doing as expected:

<pre>
    input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
    def find_skipgrams(input_list, N,K):
  bigram_list = []
  nlist=[]

  K=1
  for k in range(K+1):
      for i in range(len(input_list)-1):
          if i+k+1<len(input_list):
              nlist=[]
              for j in range(N+1):
                  if i+k+j+1<len(input_list):
                    nlist.append(input_list[i+k+j+1])

          bigram_list.append(nlist)
  return bigram_list

</pre>

The above code is not rendering correctly, but print find_skipgrams(['all', 'this', 'happened', 'more', 'or', 'less'],2,1) gives following output

[['this', 'happened', 'more'], ['happened', 'more', 'or'], ['more', 'or', 'less'], ['or', 'less'], ['less'], ['happened', 'more', 'or'], ['more', 'or', 'less'], ['or', 'less'], ['less'], ['less']]

The code listed here also does not give correct output: https://github.com/heaven00/skipgram/blob/master/skipgram.py

print skipgram_ndarray("What is your name") gives: ['What,is', 'is,your', 'your,name', 'name,', 'What,your', 'is,name']

name is a unigram!

Skillless answered 6/8, 2015 at 5:44 Comment(2)
What, if anything, have you attempted?Breskin
@Breskin updated the question !!Skillless
D
19

From the paper that OP links, the following string:

Insurgents killed in ongoing fighting

Yields:

2-skip-bi-grams = {insurgents killed, insurgents in, insurgents ongoing, killed in, killed ongoing, killed fighting, in ongoing, in fighting, ongoing fighting}

2-skip-tri-grams = {insurgents killed in, insurgents killed ongoing, insurgents killed fighting, insurgents in ongoing, insurgents in fighting, insurgents ongoing fighting, killed in ongoing, killed in fighting, killed ongoing fighting, in ongoing fighting}.

With slight modification to NLTK's ngrams code (https://github.com/nltk/nltk/blob/develop/nltk/util.py#L383):

from itertools import chain, combinations
import copy
from nltk.util import ngrams

def pad_sequence(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):
    if pad_left:
        sequence = chain((pad_symbol,) * (n-1), sequence)
    if pad_right:
        sequence = chain(sequence, (pad_symbol,) * (n-1))
    return sequence

def skipgrams(sequence, n, k, pad_left=False, pad_right=False, pad_symbol=None):
    sequence_length = len(sequence)
    sequence = iter(sequence)
    sequence = pad_sequence(sequence, n, pad_left, pad_right, pad_symbol)

    if sequence_length + pad_left + pad_right < k:
        raise Exception("The length of sentence + padding(s) < skip")

    if n < k:
        raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")    

    history = []
    nk = n+k

    # Return point for recursion.
    if nk < 1: 
        return
    # If n+k longer than sequence, reduce k by 1 and recur
    elif nk > sequence_length: 
        for ng in skipgrams(list(sequence), n, k-1):
            yield ng

    while nk > 1: # Collects the first instance of n+k length history
        history.append(next(sequence))
        nk -= 1

    # Iterative drop first item in history and picks up the next
    # while yielding skipgrams for each iteration.
    for item in sequence:
        history.append(item)
        current_token = history.pop(0)      
        # Iterates through the rest of the history and 
        # pick out all combinations the n-1grams
        for idx in list(combinations(range(len(history)), n-1)):
            ng = [current_token]
            for _id in idx:
                ng.append(history[_id])
            yield tuple(ng)

    # Recursively yield the skigrams for the rest of seqeunce where
    # len(sequence) < n+k
    for ng in list(skipgrams(history, n, k-1)):
        yield ng

Let's do some doctest to match the example in the paper:

>>> two_skip_bigrams = list(skipgrams(text, n=2, k=2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> two_skip_trigrams = list(skipgrams(text, n=3, k=2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]

But do note that if n+k > len(sequence), it will yield the same effects as skipgrams(sequence, n, k-1) (this is not a bug, it's a fail safe feature), e.g.

>>> three_skip_trigrams = list(skipgrams(text, n=3, k=3))
>>> three_skip_fourgrams = list(skipgrams(text, n=4, k=3))
>>> four_skip_fourgrams  = list(skipgrams(text, n=4, k=4))
>>> four_skip_fivegrams  = list(skipgrams(text, n=5, k=4))
>>>
>>> print len(three_skip_trigrams), three_skip_trigrams
10 [('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
>>> print len(three_skip_fourgrams), three_skip_fourgrams 
5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]
>>> print len(four_skip_fourgrams), four_skip_fourgrams 
5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]
>>> print len(four_skip_fivegrams), four_skip_fivegrams 
1 [('Insurgents', 'killed', 'in', 'ongoing', 'fighting')]

This allows n == k but it disallow n > k as shown in the lines :

if n < k:
        raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")    

For understanding sake, let's try to understand the "mystical" line:

for idx in list(combinations(range(len(history)), n-1)):
    pass # Do something

Given a list of unique items, combinations produce this:

>>> from itertools import combinations
>>> x = [0,1,2,3,4,5]
>>> list(combinations(x,2))
[(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5)]

And since the indices of a list of tokens is always unique, e.g.

>>> sent = ['this', 'is', 'a', 'foo', 'bar']
>>> current_token = sent.pop(0) # i.e. 'this'
>>> range(len(sent))
[0,1,2,3]

It's possible to compute the possible combinations (without replacement) of the range:

>>> n = 3
>>> list(combinations(range(len(sent)), n-1))
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

If we map the indices back to the list of tokens:

>>> [tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)
[('is', 'a'), ('is', 'foo'), ('is', 'bar'), ('a', 'foo'), ('a', 'bar'), ('foo', 'bar')]

Then we concatenate with the current_token, we get the skipgrams for the current token and context+skip window:

>>> [tuple([current_token]) + tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)]
[('this', 'is', 'a'), ('this', 'is', 'foo'), ('this', 'is', 'bar'), ('this', 'a', 'foo'), ('this', 'a', 'bar'), ('this', 'foo', 'bar')]

So after that we move on to the next word.

Diogenes answered 7/8, 2015 at 20:40 Comment(6)
nice work , but i would want that it should return the sentence itself if lengths exceedSkillless
can you answer this : #31828256Skillless
@Skillless it's a totally different NLP task but i'll try when i'm free =)Diogenes
About the elif nk > sequence_length: for ng in skipgrams(list(sequence), n, k-1): yield ng; it's essentially the same as how normally an ngram would be generated. I would keep it as it is instead of returning a single list of strings.Diogenes
thanks o check out that question , its suprising such a common problem is not solved yet..Skillless
also check i have issued a +100 bounty for a nlp question : #31986966Skillless
D
11

EDITED

The latest NLTK version 3.2.5 has the skipgrams implemented.

Here's a cleaner implementation from @jnothman from the NLTK repo: https://github.com/nltk/nltk/blob/develop/nltk/util.py#L538

def skipgrams(sequence, n, k, **kwargs):
    """
    Returns all possible skipgrams generated from a sequence of items, as an iterator.
    Skipgrams are ngrams that allows tokens to be skipped.
    Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf

    :param sequence: the source data to be converted into trigrams
    :type sequence: sequence or iter
    :param n: the degree of the ngrams
    :type n: int
    :param k: the skip distance
    :type  k: int
    :rtype: iter(tuple)
    """

    # Pads the sequence as desired by **kwargs.
    if 'pad_left' in kwargs or 'pad_right' in kwargs:
    sequence = pad_sequence(sequence, n, **kwargs)

    # Note when iterating through the ngrams, the pad_right here is not
    # the **kwargs padding, it's for the algorithm to detect the SENTINEL
    # object on the right pad to stop inner loop.
    SENTINEL = object()
    for ngram in ngrams(sequence, n + k, pad_right=True, right_pad_symbol=SENTINEL):
    head = ngram[:1]
    tail = ngram[1:]
    for skip_tail in combinations(tail, n - 1):
        if skip_tail[-1] is SENTINEL:
            continue
        yield head + skip_tail

[out]:

>>> from nltk.util import skipgrams
>>> sent = "Insurgents killed in ongoing fighting".split()
>>> list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> list(skipgrams(sent, 3, 2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
Diogenes answered 9/8, 2015 at 8:52 Comment(1)
COOL, where is the link?Skillless
D
6

Although this would part entirely from your code and defer it to an external library; you can use Colibri Core (https://proycon.github.io/colibri-core) for skipgram extraction. It's a library written specifically for efficient n-gram and skipgram extraction from big text corpora. The code base is in C++ (for speed/efficiency), but a Python binding is available.

You rightfully mentioned efficiency, as skipgram extraction quickly shows exponential complexity, which may not be a big issue if you only pass a sentence as you did in your input_list, but becomes problematic if you release it on large corpus data. To mitigate this you can set parameters such as an occurrence threshold, or demand each skip of a skipgram to be fillable by at least x distinct n-grams.

import colibricore

#Prepare corpus data (will be encoded for efficiency)
corpusfile_plaintext = "somecorpus.txt" #input, one sentence per line
encoder = colibricore.ClassEncoder()
encoder.build(corpusfile_plaintext)
corpusfile = "somecorpus.colibri.dat" #corpus output
classfile = "somecorpus.colibri.cls" #class encoding output
encoder.encodefile(corpusfile_plaintext,corpusfile)
encoder.save(classfile)

#Set options for skipgram extraction (mintokens is the occurrence threshold, maxlength maximum ngram/skipgram length)
colibricore.PatternModelOptions(mintokens=2,maxlength=8,doskipgrams=True)

#Instantiate an empty pattern model 
model = colibricore.UnindexedPatternModel()

#Train the model on the encoded corpus file (this does the skipgram extraction)
model.train(corpusfile, options)

#Load a decoder so we can view the output
decoder = colibricore.ClassDecoder(classfile)

#Output all skipgrams
for pattern in model:
     if pattern.category() == colibricore.Category.SKIPGRAM:
         print(pattern.tostring(decoder))

There's a more extensive Python tutorial about all this on the website.

Disclaimer: I'm the author of Colibri Core

Descendent answered 22/9, 2015 at 8:55 Comment(5)
ya I had tried that before writing this question but was unable to install colibri on ubuntuSkillless
I have improved the installation procedure and instructions last week, I hope it installs with less trouble now.Descendent
@proycon, is it possible to create duck-types in the colibri's python wrappers such that the interface looks like in NLTK, e.g. colibri.ngrams(text, n=3) or colibri.skipgram(text, n=3, k=2)? Or is it easier to re-implement some bits of colibri wrappers within the NLTK repo?Diogenes
@Diogenes I'm afraid the extra overhead would come at a huge performance cost and may lead to inefficient code. Encoding and decoding Python strings to colibri's internal compressed representation should be done as early respectively late as possible. It would be beneficial only if text is a real big blob of text (in which case it's also better to just let colibri read it from file directly as that will be faster). As to implementing a wrapper within NLTK, I'm not sure whether they want dependencies to external C++ libraries?Descendent
@proycon, thanks for the note! Possibly an NLTK wrapper that calls colibri outside of python and then read the output file within python would be faster (like what they did with Stanford/MaltParser). The overhead might be to read/write textfiles but that shouldn't be much of a problem. Thanks again!Diogenes
W
2

Refer this for complete info.

The below example has already been mentioned in it about it's usage and works like a charm!

>>>sent = "Insurgents killed in ongoing fighting".split()
>>>list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
Weinman answered 7/7, 2016 at 5:59 Comment(1)
The skipgram function was created due to this old question in nltk after it was requested in their forumSkillless
A
0

How about using someone else's implementation https://github.com/heaven00/skipgram/blob/master/skipgram.py , where k = skip_size and n=ngram_order:

def skipgram_ndarray(sent, k=1, n=2):
    """
    This is not exactly a vectorized version, because we are still
    using a for loop
    """
    tokens = sent.split()
    if len(tokens) < k + 2:
        raise Exception("REQ: length of sentence > skip + 2")
    matrix = np.zeros((len(tokens), k + 2), dtype=object)
    matrix[:, 0] = tokens
    matrix[:, 1] = tokens[1:] + ['']
    result = []
    for skip in range(1, k + 1):
        matrix[:, skip + 1] = tokens[skip + 1:] + [''] * (skip + 1)
    for index in range(1, k + 2):
        temp = matrix[:, 0] + ',' + matrix[:, index]
        map(result.append, temp.tolist())
    limit = (((k + 1) * (k + 2)) / 6) * ((3 * n) - (2 * k) - 6)
    return result[:limit]

def skipgram_list(sent, k=1, n=2):
    """
    Form skipgram features using list comprehensions
    """
    tokens = sent.split()
    tokens_n = ['''tokens[index + j + {0}]'''.format(index)
                for index in range(n - 1)]
    x = '(tokens[index], ' + ', '.join(tokens_n) + ')'
    query_part1 = 'result = [' + x + ' for index in range(len(tokens))'
    query_part2 = ' for j in range(1, k+2) if index + j + n < len(tokens)]'
    exec(query_part1 + query_part2)
    return result
Addams answered 7/8, 2015 at 19:50 Comment(2)
no it does not work,print skipgram_ndarray("What is your name") gives: ['What,is', 'is,your', 'your,name', 'name,', 'What,your', 'is,name'] the name is unigram and the other function is even more wrongSkillless
This implementation is hard coded for k<3. It does work, it's just not scaled (and also with lots of hacks... exec(...) is funny).Diogenes

© 2022 - 2024 — McMap. All rights reserved.