How to extract numbers (along with comparison adjectives or ranges)

Asked 16/7, 2017 at 7:19 Answered 28/10, 2022 at 21:25

I am working on two NLP projects in Python, and both have a similar task to extract numerical values and comparison operators from sentences, like the following:

"... greater than $10 ... ",
"... weight not more than 200lbs ...",
"... height in 5-7 feets ...",
"... faster than 30 seconds ... "

I found two different approaches to solve this problem:

using very complex regular expressions.
using Named Entity Recognition (and some regexes, too).

How can I parse numerical values out of such sentences? I assume this is a common task in NLP.

The desired output would be something like:

Input:

"greater than $10"

Output:

{'value': 10, 'unit': 'dollar', 'relation': 'gt', 'position': 3}

Pair answered 16/7, 2017 at 7:19 Comment(2)

Use CogComp-quantifier package: github.com/CogComp/cogcomp-nlp/tree/master/pipeline It can extract quantities, and normalize their units. – Emporium 17/7, 2017 at 13:45

Facebook duckling is good for this task github.com/facebookincubator/duckling – Boley 25/9, 2017 at 4:11

I would probably approach this as a chunking task and use nltk's part of speech tagger combined with its regular expression chunker. This will allow you to define a regular expression based on the part of speech of the words in your sentences instead of on the words themselves. For a given sentence, you can do the following:

import nltk

# example sentence
sent = 'send me a table with a price greater than $100'

The first thing I would do is to modify your sentences slightly so that you don't confuse the part of speech tagger too much. Here are some examples of changes that you can make (with very simple regular expressions) but you can experiment and see if there are others:

$10 -> 10 dollars
200lbs -> 200 lbs
5-7 -> 5 - 7 OR 5 to 7

so we get:

sent = 'send me a table with a price greater than 100 dollars'

now you can get the parts of speech from your sentence:

sent_pos = nltk.pos_tag(sent.split())
print(sent_pos)

[('send', 'VB'), ('me', 'PRP'), ('a', 'DT'), ('table', 'NN'), ('with', 'IN'), ('a', 'DT'), ('price', 'NN'), ('greater', 'JJR'), ('than', 'IN'), ('100', 'CD'), ('dollars', 'NNS')]

We can now create a chunker which will chunk your POS tagged text according to a (relatively) simple regular expression:

grammar = 'NumericalPhrase: {<NN|NNS>?<RB>?<JJR><IN><CD><NN|NNS>?}'
parser = nltk.RegexpParser(grammar)

This defines a parser with a grammar that chunks numerical phrases (what we'll call your phrase type). It defines your numerical phrase as: an optional noun, followed by an optional adverb, followed by a comparative adjective, a preposition, a number, and an optional noun. This is just a suggestion for how you may want to define your phrases, but I think that this will be much simpler than using a regular expression on the words themselves.

To get your phrases you can do:

print(parser.parse(sent_pos))
(S
  send/VB
  me/PRP
  a/DT
  table/NN
  with/IN
  a/DT
  (NumericalPhrase price/NN greater/JJR than/IN 100/CD dollars/NNS))

Or to get only your phrases you can do:

print([tree.leaves() for tree in parser.parse(sent_pos).subtrees() if tree.label() == 'NumericalPhrase'])

[[('price', 'NN'),
  ('greater', 'JJR'),
  ('than', 'IN'),
  ('100', 'CD'),
  ('dollars', 'NNS')]]

Knighterrantry answered 16/7, 2017 at 15:20 Comment(0)

https://spacy.io/universe/project/numerizer might work for your use case.

From the link:

from spacy import load
import numerizer
nlp = load('en_core_web_sm') # or any other model
doc = nlp('The Hogwarts Express is at platform nine and three quarters')
doc._.numerize()
# {nine and three quarters: '9.75'}

Labiodental answered 28/10, 2022 at 21:25 Comment(0)

Recommended topics

Hot tags