Strategy for parsing natural language descriptions into structured data

Asked 7/10, 2011 at 22:30 Answered 12/9, 2018 at 21:48

I have a set of requirements and I'm looking for the best Java-based strategy / algorthm / software to use. Basically, I want to take a set of recipe ingredients entered by real people in natural english and parse out the meta-data into a structured format (see requirements below to see what I'm trying to do).

I've looked around here and other places, but have found nothing that gives a high-level advice on what direction follow. So, I'll put it to the smart people :-):

What's the best / simplest way to solve this problem? Should I use a natural language parser, dsl, lucene/solr, or some other tool/technology? NLP seems like it may work, but it looks really complex. I'd rather not spend a whole lot of time doing a deep dive just to find out it can't do what I'm looking for or that there is a simpler solution.

Requirements

Given these recipe ingredient descriptions....

"8 cups of mixed greens (about 5 ounces)"
"Eight skinless chicken thighs (about 1¼ lbs)"
"6.5 tablespoons extra-virgin olive oil"
"approximately 6 oz. thinly sliced smoked salmon, cut into strips"
"2 whole chickens (3 .5 pounds each)"
"20 oz each frozen chopped spinach, thawed"
".5 cup parmesan cheese, grated"
"about .5 cup pecans, toasted and finely ground"
".5 cup Dixie Diner Bread Crumb Mix, plain"
"8 garlic cloves, minced (4 tsp)"
"8 green onions, cut into 2 pieces"

I want to turn it into this....

|-----|---------|-------------|-------------------------|--------|-----------|--------------------------------|-------------|
|     | Measure |             |                         | weight | weight    |                                |             |
| #   | value   | Measure     | ingredient              | value  | measure   | preparation                    | Brand Name  |
|-----|---------|-------------|-------------------------|--------|-----------|--------------------------------|-------------|
| 1.  | 8       | cups        | mixed greens            | 5      | ounces    | -                              | -           |
| 2.  | 8       | -           | skinless chicken thigh  | 1.5    | pounds    | -                              | -           |
| 3.  | 6.5     | tablespoons | extra-virgin olive oil  | -      | -         | -                              | -           |
| 4.  | 6       | ounces      | smoked salmon           | -      | -         | thinly sliced, cut into strips | -           |
| 5.  | 2       | -           | whole chicken           | 3.5    | pounds    | -                              | -           |
| 6.  | 20      | ounces      | forzen chopped spinach  | -      |           | thawed                         | -           |
| 7.  | .5      | cup         | parmesean cheese        | -      | -         | grated                         | -           |
| 8.  | .5      | cup         | pecans                  | -      | -         | toasted, finely ground         | -           |
| 9.  | .5      | cup         | Bread Crumb Mix, plain  | -      | -         | -                              | Dixie Diner |
| 10. | 8       | -           | garlic clove            | 4      | teaspoons | minced                         | -           |
| 11. | 8       | -           | green onions            | -      | -         | cut into 2 pieces              | -           |
|-----|---------|-------------|-------------------------|--------|-----------|--------------------------------|-------------|

Note the diversity of the descriptions. Some things are abbreviated, some are not. Some numbers are numbers, some are spelled out.

I would love something that does a perfect parse/translation. But, would settle for something that does reasonably well to start.

Bonus question: after suggesting a strategy / tool, how would you go about it?

Thanks!

Joe

Hodgkin answered 7/10, 2011 at 22:30 Comment(2)

Thanks, everyone for the answers. Sounds like the majority is recommending a Natural Language Parsing approach. I don't mind jumping in, but some of the terminology is a bit daunting, i.e. tagged corpus for training a statistical model. I'm new to all this. I'll start with Stanford NLP, since that seems to be the most popular/recommended (nlp.stanford.edu/software/lex-parser.shtml). If anyone have any good places where to start on NLP concepts, please let me know. Thanks! – Hodgkin 10/10, 2011 at 18:22

12 years later: Use ChatGPT or Bard. – Siqueiros 5/8, 2023 at 19:50

Short answer. Use GATE.

Long answer. You need some tool for pattern recognition in text. Something, that can catch patterns like:

{Number}{Space}{Ingredient}
{Number}{Space}{Measure}{Space}{"of"}{Space}{Ingredient}
{Number}{Space}{Measure}{Space}{"of"}{Space}{Ingredient}{"("}{Value}{")"}
...

Where {Number} is a number, {Ingredient} is taken from dictionary of ingredients, {Measure} - from dictionary measures and so on.

Patterns I described are very similar to GATE's JAPE rules. With them you catch text that matches pattern and assign some label to each part of a pattern (number, ingredient, measure, etc.). Then you extract labeled text and put it into single table.

Dictionaries I mentioned can be represented by Gazetteers in GATE.

So, GATE covers all your needs. It's not the easiest way to start, since you will have to learn at least GATE's basics, JAPE rules and Gazetteers, but with such approach you will be able to get really good results.

Wig answered 8/10, 2011 at 0:43 Comment(0)

It is basically natural language parsing. (You did already stemming chicken[s].) So basically it is a translation process. Fortunately the context is very restricted.

You need a supportive translation, where you can add dictionary entries, adapt the grammar rules and retry again.

An easy process/work flow in this case is much more important than the algorithms. I am interested in both aspects.

If you need a programming hand for an initial prototype, feel free to contact me. I did see, you are already working quite structured.

Unfortunately I do not know of fitting frameworks. You are doing something, that Mathematica wants to do with its Alpha (natural language commands yielding results). Data mining? But simple natural language parsing with a manual adaption process should give fast and easy results.

Mothering answered 8/10, 2011 at 0:11 Comment(2)

Hi Joop. Would love to work with you on this. I don't have the NLP background, but it seems to me that Stanford NLP (nlp.stanford.edu/software/lex-parser.shtml) would be a good framework to look at initially. Also, ffriend's recommendation of GATE looks promising. – Hodgkin 10/10, 2011 at 18:25

Sorry, I deliberately stayed away to let things settle, but now it's maybe a bit late. Stanford NLP looks fine. If I could get involved, you can reach me under joop underscore eggen at y a h oo de. – Mothering 18/11, 2011 at 11:24

You also can try Gexp. Then you have to write rules as Java class such as

seq(Number, opt(Measure), Ingradient, opt(seq(token("("), Number, Measure, token(")")))

Then you have to add some group to capture (group(String name, Matcher m)) and extrat parts of pattern and store this information into table. For Number, Measure you should use similar Gexp pattern, or I would recommend some Shallow parsing for noun phrase detection with words from Ingradients.

Brecciate answered 8/10, 2011 at 9:20 Comment(0)

If you don't want to be exposed to the nitty-gritty of NLP and machine learning, there are a few hosted services that do this for you:

Zestful (disclaimer: I'm the author)
Spoonacular
Edamam

If you are interested in the nitty-gritty, the New York Times wrote about how they parsed their ingredient archive. They open-sourced their code, but abandoned it soon after. I maintain the most up-to-date version of it and I wrote a bit about how I modernized it.

Capel answered 12/9, 2018 at 21:48 Comment(0)

Do you have access to a tagged corpus for training a statistical model? That is probably the most fruitful avenue here. You could build one up using epicurious.com; scrape a lot of their recipe ingredients lists, which are in the kind of prose form you need to parse, and then use their helpful "print a shopping list" feature, which provides the same ingredients in a tabular format. You can use this data to train a statistical language model, since you will have both the raw untagged data, and the expected parse results for a large number of examples.

This might be a bigger project than you have in mind, but I think in the end it will produce better results than a structured top-down parsing approach will.

Abiding answered 8/10, 2011 at 3:42 Comment(2)

I think that recipes may in fact be so structured that a finite state/heuristic solution might be easier for this task. – Openandshut 8/10, 2011 at 16:16

Easier, yes, but I think you'd get better results with a good corpus. It's a nice constrained problem, though - doesn't get much easier in a natural language. – Abiding 8/10, 2011 at 17:37

Requirements

Recommended topics

Hot tags