How to retrieve all kinds of dates and temporal values from text
Asked Answered
O

5

5

I wanted to retrieve dates and other temporal entities from a set of Strings. Can this be done without parsing the string for dates in JAVA as most parsers deal with a limited scope of input patterns. But input is a manual entry which here and hence ambiguous.

Inputs can be like:

12th Sep |mid-March |12.September.2013

Sep 12th |12th September| 2013

Sept 13 |12th, September |12th,Feb,2013

I've gone through many answers on finding date in Java but most of them don't deal with such a huge scope of input patterns.

I've tried using SimpleDateFormat class and using some parse() functions to check if parse function breaks which mean its not a date. I've tried using regex but I'm not sure if it falls fit in this scenario. I've also used ClearNLP to annotate the dates but it doesn't give a reliable annotation set.

The closest approach to getting these values could be using a Chain of responsibility as mentioned below. Is there a library that has a set of patterns for date. I can use that maybe?

Orchardman answered 13/10, 2015 at 9:7 Comment(6)
somewhere you have to limit the scope, try to wrap input into your own fixed format.Kayser
@ankur-singhal too late for that buddy I can't change this old data now, its already there, I'm only extractingOrchardman
Can you provide some information on what dates should the lines equate to? mid-March is a bit too ambiguous even for human processing.Wherefrom
mid-March is not a dateBackstage
@Orchardman straight forward there is nothing to extract data, it seems more like a data analytic problem, you might need to analyse the complete data, and then come up with some solution. having some keywords like Sep, September, th, nd, rd etc.Kayser
True I'll correct that @peter.petrov. I needed all temporal entities not neccessarily dates. Yes ankur-singhal you are correct on that. Months, one to four digit numbers and prepositions like "in,on,by,of" are the only tangibles I can find.Orchardman
O
1

Yes! I've finally extracted all sorts of dates/temporal values that can be as generic as :

mid-March | Last Month | 9/11

To as specific as:

11/11/11 11:11:11

This finally happened because of awesome libraries from GATE and JAPE

I've created a more lenient annotation rule in JAPE say 'DateEnhanced' to include certain kinds of dates like "9/11 or 11TH, February- 2001" and used a Chaining of Java regex on R.H.S. of the 'DateEnhanced' annotations JAPE RULE, to filter some unwanted outputs.

Orchardman answered 15/10, 2015 at 13:35 Comment(0)
S
2

A clean and modular approach to this problem would be to use a chain, every element of the chain tries to match the input string against a regex, if the regex matches the input string than you can convert the input string to something that can feed a SimpleDateFormat to convert it to the data structure you prefer (Date? or a different temporal representation that better suits your needs) and return it, if the regexp doesn't matches the chain element just delegates to the next element in the chain.

The responsibility of every element of the chain is just to test the regex against the string, give a result or ask the next element of the chain to give it a try.

The chain can be created and composed easily without having to change the implementation of every element of the chain.

In the end the result is the same as in @KirkoR response, with a 'bit' (:D) more code but a modular approach. (I prefer the regex approach to the try/catch one)

Some reference: https://en.wikipedia.org/wiki/Chain-of-responsibility_pattern

Soukup answered 13/10, 2015 at 9:40 Comment(2)
This is a very relevant thing. This approach can be very helpful. +1 ThanksOrchardman
Finally made it. GATE is awesome!Orchardman
B
1

You could just implement support for all the pattern possibilities you can think of, then document that ... OK, these are all patterns my module supports. You could then throw some RuntimeException for all the other possibilities.

Then ... in an iterative way you can keep running your module over the input data, and keep adding support for more date formats until it stops raising any RuntimeException.

I think that's the best you can do here if you want to keep it reasonably simple.

Backstage answered 13/10, 2015 at 9:14 Comment(1)
Is there a library that stores such temporal patterns? Some kind of Gazetteer other than ClearNLP?Orchardman
O
1

Yes! I've finally extracted all sorts of dates/temporal values that can be as generic as :

mid-March | Last Month | 9/11

To as specific as:

11/11/11 11:11:11

This finally happened because of awesome libraries from GATE and JAPE

I've created a more lenient annotation rule in JAPE say 'DateEnhanced' to include certain kinds of dates like "9/11 or 11TH, February- 2001" and used a Chaining of Java regex on R.H.S. of the 'DateEnhanced' annotations JAPE RULE, to filter some unwanted outputs.

Orchardman answered 15/10, 2015 at 13:35 Comment(0)
D
0

I can recommend to you very nice implementation of your problem, unfortunetlly in polish: http://koziolekweb.pl/2015/04/15/throw-to-taki-inny-return/

You can use google translator:

https://translate.google.pl/translate?sl=pl&tl=en&js=y&prev=_t&hl=en&ie=UTF-8&u=http%3A%2F%2Fkoziolekweb.pl%2F2015%2F04%2F15%2Fthrow-to-taki-inny-return&edit-text=

The code there looks really nice:

private static Date convertStringToDate(String s) {                           
    if (s == null || s.trim().isEmpty()) return null;                         
    ArrayList<String> patterns = Lists.newArrayList(YYYY_MM_DD_T_HH_MM_SS_SSS,
            YYYY_MM_DD_T_HH_MM_SS                                             
            , YYYY_MM_DD_T_HH_MM                                              
            , YYYY_MM_DD);                                                    
    for (String pattern : patterns) {                                         
        try {                                                                 
            return new SimpleDateFormat(pattern).parse(s);                    
        } catch (ParseException e) {                                          
        }                                                                     
    }                                                                         
    return new Date(Long.valueOf(s));                                         
}
Dissimilate answered 13/10, 2015 at 9:27 Comment(2)
Seems OP does not need just dates, but all "temporal entities" which is a concept that needs a stricter definition IMHO.Backstage
Can DateFormat match 12th, September. Maybe I can write a complicated method using peter's suggestion with a divide and conquer approachOrchardman
B
0
    mark.util.DateParser dp = new DateParser();
    ParsePositionEx parsePosition = new ParsePositionEx(0);
    Date startDate = dp.parse("12.September.2013", parsePosition);
    System.out.println(startDate);

output: Thu Sep 12 17:18:18 IST 2013

mark.util.Dateparser is a part of library which is used by DateNormalizer PR. So in Jape file, we have to just import it.

Bestraddle answered 20/1, 2016 at 11:52 Comment(2)
This can be used to normalize dates. You can avoid this if you only want to extract datesOrchardman
Sorry, I overlooked your question saying "without parsing".Bestraddle

© 2022 - 2024 — McMap. All rights reserved.