How can Stanford CoreNLP Named Entity Recognition capture measurements like 5 inches, 5", 5 in., 5 in

Asked 13/12, 2015 at 14:30 Answered 23/12, 2015 at 22:9

Solved nlp stanford-nlp named-entity-recognition named-entity-extraction

I'm looking to capture measurements using Stanford CoreNLP. (If you can suggest a different extractor, that is fine too.)

For example, I want to find 15kg, 15 kg, 15.0 kg, 15 kilogram, 15 lbs, 15 pounds, etc. But among CoreNLPs extraction rules, I don't see one for measurements.

Of course, I can do this with pure regexes, but toolkits can run more quickly, and they offer the opportunity to chunk at a higher level, e.g. to treat gb and gigabytes together, and RAM and memory as building blocks--even without full syntactic parsing--as they build bigger units like 128 gb RAM and 8 gigabytes memory.

I want an extractor for this that is rule-based, not machine-learning-based), but don't see one as part of RegexNer or elsewhere. How do I go about this?

IBM Named Entity Extraction can do this. The regexes are run in an efficient way rather than passing the text through each one. And the regexes are bundled to express meaningful entities, as for example one that unites all the measurement units into a single concept.

Pregnable answered 13/12, 2015 at 14:30 Comment(0)

I don't think a rule-based system exists for this particular task. However, it shouldn't be hard to make with TokensregexNER. For example, a mapping like:

[{ner:NUMBER}]+ /(k|m|g|t)b/ memory?   MEMORY
[{ner:NUMBER}]+ /"|''|in(ches)?/       LENGTH
...

You could try using vanilla TokensRegex as well, and then just extract out the relevant value with a capture group:

(?$group_name [{ner:NUMBER}]+) /(k|m|g|t)b/ memory?

Announcement answered 23/12, 2015 at 22:9 Comment(3)

It looks like this is a special feature of IBM Named Entity Extraction. Regexes are of course possible in any system, but IBM NEE can run patterns far more efficently; and also treat related concepts together. (E.g., postal codes take dozens of forms worldwide but are all "postal codes" for the purpose of higher level concepts.) – Pregnable 28/12, 2015 at 11:2

This is certainly true: IBM has a far faster engine for this sort of regex matching. However, the examples above are from CoreNLP. TokensRegex (included in CoreNLP) is generally fast enough for most applications; particularly if either (1) the patterns are simple (no variable-length matches), or (2) there are few enough of them. – Announcement 28/12, 2015 at 21:55

If you're looking for a mainly regex-based solution, you could also look at GATE's JAPE regular expression environment. I'm not sure it's any faster than ours, but it does have more GUI support. – Mcalister 3/1, 2016 at 4:25

You can build your own training data and label the required measurements accordingly.

For example if you have a sentence like Jack weighs about 50 kgs

So the model will classify your input as:

Jack, PERSON
weighs, O
about, O
50, MES
kgs, MES

Where MES stands for measurements.

I have recently made training data for the Stanford NER tagger for my customized problem and have built a model for it.

I think for Stanford CoreNLP NER also you can do the same thing

This may be a machine learning-based approach rather than a rule-based approach

Roaster answered 23/12, 2015 at 10:16 Comment(3)

Thank you, Rohan. An ML-based approach can be valuable. But clearly some rules will give us a lot of value here. There are too many regexes for an ad hoc solution without CoreNLP to be simple or performant, but I would like it if an Entity Extraction tool could let me bundle these regexes in a way that does make it simple and performant. – Pregnable 23/12, 2015 at 14:42

Ya it is possible. Requires a lot of research for this problem. :) – Roaster 28/12, 2015 at 11:9

Recommended topics

Hot tags