How to extract character ngram from sentences? - python
Asked Answered
S

1

6

The following word2ngrams function extracts character 3grams from a word:

>>> x = 'foobar'
>>> n = 3
>>> [x[i:i+n] for i in range(len(x)-n+1)]
['foo', 'oob', 'oba', 'bar']

This post shows the character ngrams extraction for a single word, Quick implementation of character n-grams using python.

But what if i have sentences and i want to extract the character ngrams, is there a faster method other than iteratively call the word2ngram()?

What will be the regex version of achieving the same word2ngram and sent2ngram output? would it be faster?

I've tried:

import string, random, time
from itertools import chain

def word2ngrams(text, n=3):
  """ Convert word into character ngrams. """
  return [text[i:i+n] for i in range(len(text)-n+1)]

def sent2ngrams(text, n=3):
    return list(chain(*[word2ngrams(i,n) for i in text.lower().split()]))

def sent2ngrams_simple(text, n=3):
    text = text.lower()
    return [text[i:i+n] for i in range(len(text)-n+1) if not " " in text[i:i+n]]

# Generate 10000 random strings of length 100.
sents = [" ".join([''.join(random.choice(string.ascii_uppercase) for j in range(10)) for i in range(100)]) for k in range(100)]

start = time.time()
x = [sent2ngrams(i) for i in sents]
print time.time() - start        

start = time.time()
y = [sent2ngrams_simple(i) for i in sents]
print time.time() - start        

print x==y

[out]:

0.0205280780792
0.0271739959717
True

EDITED

The regex method looks elegant but it performs slower than iteratively calling word2ngram():

import string, random, time, re
from itertools import chain

def word2ngrams(text, n=3):
  """ Convert word into character ngrams. """
  return [text[i:i+n] for i in range(len(text)-n+1)]

def sent2ngrams(text, n=3):
    return list(chain(*[word2ngrams(i,n) for i in text.lower().split()]))

def sent2ngrams_simple(text, n=3):
    text = text.lower()
    return [text[i:i+n] for i in range(len(text)-n+1) if not " " in text[i:i+n]]

def sent2ngrams_regex(text, n=3):
    rgx = '(?=('+'\S'*n+'))'
    return re.findall(rgx,text)

# Generate 10000 random strings of length 100.
sents = [" ".join([''.join(random.choice(string.ascii_uppercase) for j in range(10)) for i in range(100)]) for k in range(100)]

start = time.time()
x = [sent2ngrams(i) for i in sents]
print time.time() - start        

start = time.time()
y = [sent2ngrams_simple(i) for i in sents]
print time.time() - start        

start = time.time()
z = [sent2ngrams_regex(i) for i in sents]
print time.time() - start  

print x==y==z

[out]:

0.0211708545685
0.0284190177917
0.0303599834442
True
Status answered 15/3, 2014 at 18:32 Comment(0)
R
6

Why not just (?=(...))

edit Same thing, but not whitespace (?=(\S\S\S))
edit2 You can use just what you want as well. Ex. uses alphanum only (?=([^\W_]{3}))

Uses a lookahead to capture 3 characters. Then the engine bumps the position up 1 time each
match. Then captures next 3.

Result of foobar is
foo
oob
oba
bar

 # Compressed regex
 #  (?=(...))

 # Expanded regex
 (?=                   # Start Lookahead assertion
      (                     # Capture group 1 start
           .                     # dot - metachar, matches any character except newline
           .                     # dot - metachar
           .                     # dot - metachar
      )                     # Capture group 1 end
 )                     # End Lookahead assertion
Rector answered 15/3, 2014 at 19:19 Comment(11)
pardon my noobiness, what is (?=(...))? can you give a working example? i've tried: (?=('foobar')) but got a syntax error.Status
Added some comments, (?=(...)) is the regex. I don't know Python, but it should be used as a regular expression in match all (get an array output) context.Rector
re.findall(r'(?=(...))','foobar'), out: ['foo', 'oob', 'oba', 'bar'].Status
Cool ... Is that what you are looking for?Rector
[i for i in re.findall(r'(?=(...))','foobar like') if not " " in i]Status
yeah, let me try to profile, it and see how much time i save =). Thanks for the neat regex trick. Is there a way to not check for the " "?Status
So, not whitespace then. Hang on a sec.Rector
Yeah, so same thing, no whitespace (?=(\S\S\S))Rector
@sin, it's a nice regex but it performs much slower than iterating through individual words =(Status
@Status - A slowdown could be rgx = should be compiled only once, not for each sentence. It should be pre-compiled before the iteration. And you could also improve the speed of the regex by %10-15 if you actively move the match position. i.e, /(?=(\S\S\S))./ Add the Dot-All modifier (same as /(?=(\S\S\S))[\S\s]/ or /(?s)(?=(\S\S\S))./).Rector
It solved my problem; without even getting near to your approach! Thanks!Lutz

© 2022 - 2024 — McMap. All rights reserved.