I have a set of requirements and I'm looking for the best Java-based strategy / algorthm / software to use. Basically, I want to take a set of recipe ingredients entered by real people in natural english and parse out the meta-data into a structured format (see requirements below to see what I'm trying to do).
I've looked around here and other places, but have found nothing that gives a high-level advice on what direction follow. So, I'll put it to the smart people :-):
What's the best / simplest way to solve this problem? Should I use a natural language parser, dsl, lucene/solr, or some other tool/technology? NLP seems like it may work, but it looks really complex. I'd rather not spend a whole lot of time doing a deep dive just to find out it can't do what I'm looking for or that there is a simpler solution.
Requirements
Given these recipe ingredient descriptions....
- "8 cups of mixed greens (about 5 ounces)"
- "Eight skinless chicken thighs (about 1¼ lbs)"
- "6.5 tablespoons extra-virgin olive oil"
- "approximately 6 oz. thinly sliced smoked salmon, cut into strips"
- "2 whole chickens (3 .5 pounds each)"
- "20 oz each frozen chopped spinach, thawed"
- ".5 cup parmesan cheese, grated"
- "about .5 cup pecans, toasted and finely ground"
- ".5 cup Dixie Diner Bread Crumb Mix, plain"
- "8 garlic cloves, minced (4 tsp)"
- "8 green onions, cut into 2 pieces"
I want to turn it into this....
|-----|---------|-------------|-------------------------|--------|-----------|--------------------------------|-------------| | | Measure | | | weight | weight | | | | # | value | Measure | ingredient | value | measure | preparation | Brand Name | |-----|---------|-------------|-------------------------|--------|-----------|--------------------------------|-------------| | 1. | 8 | cups | mixed greens | 5 | ounces | - | - | | 2. | 8 | - | skinless chicken thigh | 1.5 | pounds | - | - | | 3. | 6.5 | tablespoons | extra-virgin olive oil | - | - | - | - | | 4. | 6 | ounces | smoked salmon | - | - | thinly sliced, cut into strips | - | | 5. | 2 | - | whole chicken | 3.5 | pounds | - | - | | 6. | 20 | ounces | forzen chopped spinach | - | | thawed | - | | 7. | .5 | cup | parmesean cheese | - | - | grated | - | | 8. | .5 | cup | pecans | - | - | toasted, finely ground | - | | 9. | .5 | cup | Bread Crumb Mix, plain | - | - | - | Dixie Diner | | 10. | 8 | - | garlic clove | 4 | teaspoons | minced | - | | 11. | 8 | - | green onions | - | - | cut into 2 pieces | - | |-----|---------|-------------|-------------------------|--------|-----------|--------------------------------|-------------|
Note the diversity of the descriptions. Some things are abbreviated, some are not. Some numbers are numbers, some are spelled out.
I would love something that does a perfect parse/translation. But, would settle for something that does reasonably well to start.
Bonus question: after suggesting a strategy / tool, how would you go about it?
Thanks!
Joe
ChatGPT
orBard
. – Siqueiros