detect dates in spacy
Asked Answered
M

2

9

Is there a way to write a rule based system to catch things like start/end dates from a contract text. Here are a few real examples. I am bolding the date entities which I want spacy to automatically detect. If you have other ideas different than spacy that is also OK!

  1. The initial term of this Lease shall be for a period of Five (5) years commencing on February 1, 2012, (the “Lease Commencement Date”) and expiring on January 31, 2017 (the “Initial Lease Term”).

  2. Term: One (1) year commencing January 1, 2007 ("Commencement Date") and ending December 31, 2007 ("Expiration Date").

  3. This Lease Agreement is entered into for term of 15 years, beginning January 1, 2014 and ending on December 31, 2028.

Maynard answered 15/12, 2019 at 13:25 Comment(6)
Dates can be super complicated. Can you be certain that you will only be looking for dates in the format MonthName dayNum, 4DigitYear?Melancholy
No guarantee what format it will be in. Could be MONTH, DAY, YEAR, or MM/DD/YYYY for example.Maynard
That makes it more difficult. Could it also be DD/MM/YYYY or DD/MM/YY, or YYYY/MM/DD, or YY/MM/DD? This is why dates are complicated in programing.Melancholy
oh, i actually wasn't worried about this detail, because one could just submit the date to dateutil.parser and see if it is recognized...Maynard
But you still need to recognize it as a date. You can't do that without knowing all the formats that a date could be in.Melancholy
i know spacy can recognize it as a date. i just want to subselect for those dates which are start/end dates. –Maynard
B
6

I think you have to make a clear distinction between two types of methods:

1) Statistical models / Machine Learning, a.k.a. NER models. These will take the context of the sentence into account when trying to figure out whether a specific token, or multiple consecutive tokens, are a date. spaCy has pre-built NER models you can download to try out on your specific data. You'll want to look for those entities (in doc.ents) that have ent.label_ == DATE. Once you have those entities, you can run them through a date parser to understand what the actual date is. See also here for more information.

2) Rule-based entity recognition. Here, you have to define the rules yourself by specifying how you expect your date will look like, e.g. XX/XX/XXXX with X being a digit. As user1558604 pointed out though, you'll have to write multiple different rules if you want to recognize different representations of dates. You can find an overview of spaCy's rule-based matching methods here.

Bibliogony answered 15/12, 2019 at 19:37 Comment(5)
Thanks! Right now, we have a set of rules that select the start and end dates from all the spacy recognized dates. We want to make a more sophisticated rule-based approach before going to machine learning though. A few reasons for this: 1) we will establish a baseline accuracy/recall threshold to which we can compare future statistical models; 2) we will discover more about the problem and better understand its subtleties; 3) we can use the rule based approach to help efficiently label data for future training. maybe we should use the parsing tool?love to hear your thoughts. thanks!Maynard
Ok so if I understand you correctly, you are already using the NER models in spaCy, and you want the rules to look at the surrounding sentence and extract begin/end clues ?Bibliogony
I think using the parser could definitely help you. Actually, spaCy has a currently experimental and undocumented DependencyMatcher that could be useful to you. See also #57664764 and github.com/explosion/spaCy/issues/4433Bibliogony
Spacy works... I tried it myself. Will update this thread once my POC project is ready for future referenceBeef
github.com/anshumankmr/rasa-chat-botBeef
W
-1

You can use SUTime from CoreNLP to do it easily: https://github.com/FraBle/python-sutime

Woody answered 15/12, 2019 at 13:39 Comment(2)
because i do not know how to use that software. is it even in python, or just java?Maynard
This library is a python wrapper on top of orginal java implementation. You can use it via python. If you go through the link in my answer, you will get the installation instruction and sample code for it.Woody

© 2022 - 2024 — McMap. All rights reserved.