Using integers/dates as terminals in NLTK parser
Asked Answered
H

1

6

I'm trying to use the Earley parser in NLTK to parse sentences such as:

If date is before 12/21/2010 then serial = 10

To do this, I'm trying to write a CFG but the problem is I would need to have a general format of dates and integers as terminals, instead of the specific values. Is there any ways to specify the right hand side of a production rule as a regular expression, which would allow this kind of processing?

Something like:

S -> '[0-9]+'

which would handle all integers.

Hazelhazelnut answered 10/11, 2010 at 19:17 Comment(2)
Your date format is locale dependant. And mainly is ambigous (collide with a mathematical expression 12 div 21 div 2010 which is probably not that you wantJacobo
You're right but that will be easy to handle since the input will never contain any mathematical expressions like what you mentioned. Also the date format will be fixed, say, MM/DD/YYYY. I found a way to handle integers, but I'm still looking for a proper solution for dates.Hazelhazelnut
B
2

For this to work, you'll need to tokenize the date so that each digit and slash is a separate token.

from nltk.parse.earleychart import EarleyChartParser
import nltk

grammar = nltk.parse_cfg("""
DATE -> MONTH SEP DAY SEP YEAR
SEP -> "/"
MONTH -> DIGIT | DIGIT DIGIT
DAY -> DIGIT | DIGIT DIGIT
YEAR -> DIGIT DIGIT DIGIT DIGIT
DIGIT -> '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | '0'
""")

parser = EarleyChartParser(grammar)
print parser.parse(["1", "/", "1", "0", "/", "1", "9", "8", "7"])

The output is:

(DATE
  (MONTH (DIGIT 1))
  (SEP /)
  (DAY (DIGIT 1) (DIGIT 0))
  (SEP /)
  (YEAR (DIGIT 1) (DIGIT 9) (DIGIT 8) (DIGIT 7)))

This also affords some flexibility in the form of allowing dates and months to be single-digit.

Bowerman answered 4/6, 2011 at 19:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.