Best way to identify and extract dates from text Python?

Asked 15/11, 2013 at 5:50 Answered 25/10, 2023 at 14:40

As part of a larger personal project I'm working on, I'm attempting to separate out inline dates from a variety of text sources.

For example, I have a large list of strings (that usually take the form of English sentences or statements) that take a variety of forms:

Central design committee session Tuesday 10/22 6:30 pm

Th 9/19 LAB: Serial encoding (Section 2.2)

There will be another one on December 15th for those who are unable to make it today.

Workbook 3 (Minimum Wage): due Wednesday 9/18 11:59pm

He will be flying in Sept. 15th.

While these dates are in-line with natural text, none of them are in specifically natural language forms themselves (e.g., there's no "The meeting will be two weeks from tomorrow"—it's all explicit).

As someone who doesn't have too much experience with this kind of processing, what would be the best place to begin? I've looked into things like the dateutil.parser module and parsedatetime, but those seem to be for after you've isolated the date.

Because of this, is there any good way to extract the date and the extraneous text

input:  Th 9/19 LAB: Serial encoding (Section 2.2)
output: ['Th 9/19', 'LAB: Serial encoding (Section 2.2)']

or something similar? It seems like this sort of processing is done by applications like Gmail and Apple Mail, but is it possible to implement in Python?

Oldest answered 15/11, 2013 at 5:50 Comment(4)

@Kyle Kelley : Have you tried python regex? – Velocity 15/11, 2013 at 7:31

@NilaniAlgiriyage Most certainly. However, I'd much rather use someone's battled tested libraries first before rolling my own regex. We could clearly write one for the cases outlined above, then update it with more cases and more logic. – Euphemie 15/11, 2013 at 13:12

For the sake of humanity though, it makes more sense to contribute upstream to an open source project. There may even be regular expressions in it. :P – Euphemie 15/11, 2013 at 13:14

You can take a look at datefinder. – Farmstead 24/9, 2021 at 4:54

I was also looking for a solution to this and couldn't find any, so a friend and I built a tool to do this. I thought I would come back and share incase others found it helpful.

datefinder -- find and extract dates inside text

Here's an example:

import datefinder

string_with_dates = '''
    Central design committee session Tuesday 10/22 6:30 pm
    Th 9/19 LAB: Serial encoding (Section 2.2)
    There will be another one on December 15th for those who are unable to make it today.
    Workbook 3 (Minimum Wage): due Wednesday 9/18 11:59pm
    He will be flying in Sept. 15th.
    We expect to deliver this between late 2021 and early 2022.
'''

matches = datefinder.find_dates(string_with_dates)
for match in matches:
    print(match)

Onstad answered 28/1, 2016 at 18:20 Comment(15)

Really great tool, it accounts for a large variety of cases! Great work. – Steric 11/5, 2016 at 0:9

it is strange but this is what is happening. I am using doing string_with_dates = """ ...: ... ...: entries are due by January 4th, 2017 at 8:00pm ...: ... ...: created 01/15/2005 by ACME Inc. and associates.""" matches = datefinder.find_dates(string_with_dates,source=True). the first time I access the matches generator I am able to print results. the next time I cant print it. I am using 3.4 with syntax: for m in matches: print(m) – Kelsi 26/12, 2016 at 23:22

is it possible to get original string list? example: "entries are due by January 4th, 2017 at 8:00pm", and if I run like : matches = datefinder.find_dates(text) will return ['January 4th, 2017 at 8:00pm'] – Octarchy 5/4, 2017 at 11:29

@SomnathKadam Yes: github.com/akoumjian/datefinder/blob/master/datefinder.py#L284 – Onstad 9/4, 2017 at 17:58

@Onstad I'm trying to install your package and I get this error: failed building wheel for regex. And then it says: running setup.py clean for regex and eventually a lot of other errors ending with "Rolling back uninstall of regex" and a long red line. Have other people reported installation errors like this? I appreciate any recommendations. Thanks – Azelea 22/4, 2018 at 23:14

@Onstad thanks for such a great tool! But I want to know whether we can get the position or index of date token in the actual text? – Hilversum 28/11, 2018 at 7:30

In correspondence of a string like -039_apple the parser returns 2039-12-12 00:00:00: any reason why it does so? Likewise in correspondence of other hyphenation/punctuations present in the text. – Moschatel 12/12, 2018 at 9:38

@Hilversum Please see the documentation that's part of the basic api: datefinder.readthedocs.io/en/latest – Onstad 29/12, 2018 at 2:1

@Azelea that should be fixed now – Onstad 29/12, 2018 at 2:1

@Moschatel The capture model for datefinder is going to create some false positives. Essentially, it casts a wide net over possible date interpretations, sends a possible match to dateutil, and then returns whatever dateutil says it is. – Onstad 29/12, 2018 at 2:4

great package, thanks! Is it possible to add timezone recognition, or am I asking for too much? :) – Katerinekates 9/4, 2020 at 2:4

I don't know why but it can't detect date from my string '''other text Notice :Download Procedure: Open Closing date: 14/04/2021'' – Willdon 11/4, 2021 at 4:59

Great work. It solve problem up to good extent. – Darladarlan 7/5, 2021 at 10:20

It's not detecting any type of date. For example, I tried on "This is sample date 10-10-2021" and the library was unable to extract the date from it. – Arther 15/7, 2021 at 17:34

For anyone serious about parsing everthing, threre are a few caveats with datefinder. It ommits the following formats from the Oracle standard format library: #B,#C,#D: (MMDDYY), (DDMMYY) and (YYMMDD), #G (YYMonDD) and some alterations of that , and some unusual ones such as (day/YY). – Tapis 14/11, 2022 at 2:23

I am surprised that there is no mention of SUTime and dateparser's search_dates method.

from sutime import SUTime
import os
import json
from dateparser.search import search_dates

str1 = "Let's meet sometime next Thursday" 

# You'll get more information about these jar files from SUTime's github page
jar_files = os.path.join(os.path.dirname(__file__), 'jars')
sutime = SUTime(jars=jar_files, mark_time_ranges=True)

print(json.dumps(sutime.parse(str1), sort_keys=True, indent=4))
"""output: 
[
    {
        "end": 33,
        "start": 20,
        "text": "next Thursday",
        "type": "DATE",
        "value": "2018-10-11"
    }
]
"""

print(search_dates(str1))
#output:
#[('Thursday', datetime.datetime(2018, 9, 27, 0, 0))]

Although I have tried other modules like dateutil, datefinder and natty (couldn't get duckling to work with python), this two seem to give the most promising results.

The results from SUTime are more reliable and it's clear from the above code snippet. However, the SUTime fails in some basic scenarios like parsing a text

"I won't be available until 9/19"

"I won't be available between (September 18-September 20).

It gives no result for the first text and only gives month and year for the second text. This is however handled quite well in the search_dates method. search_dates method is more aggressive and will give all possible dates related to any words in the input text.

I haven't yet found a way to parse the text strictly for dates in search_methods. If I could find a way to do that, it'll be my first choice over SUTime and I would also make sure to update this answer if I find it.

Asphodel answered 3/10, 2018 at 22:21 Comment(1)

sutime does seem to be the most powerful, especially after the Stanford CoreNLP team has had even more time to develop and improve it. However, I'm having some trouble and getting an Import Error when importing sutime. I posted more at length about it here, if you happen to have encountered a similar error: #59744152 – Botanist 15/1, 2020 at 2:19

You can use the dateutil module's parse method with the fuzzy option.

>>> from dateutil.parser import parse
>>> parse("Central design committee session Tuesday 10/22 6:30 pm", fuzzy=True)
datetime.datetime(2018, 10, 22, 18, 30)
>>> parse("There will be another one on December 15th for those who are unable to make it today.", fuzzy=True)
datetime.datetime(2018, 12, 15, 0, 0)
>>> parse("Workbook 3 (Minimum Wage): due Wednesday 9/18 11:59pm", fuzzy=True)
datetime.datetime(2018, 3, 9, 23, 59)
>>> parse("He will be flying in Sept. 15th.", fuzzy=True)
datetime.datetime(2018, 9, 15, 0, 0)
>>> parse("Th 9/19 LAB: Serial encoding (Section 2.2)", fuzzy=True)
datetime.datetime(2002, 9, 19, 0, 0)

Wollastonite answered 13/7, 2018 at 11:25 Comment(1)

For dateutil,parser, a few cavetas to mention. Feb172009 is read as Feb 14, that is one should be careful for dates without spaces, which could be the case for Oracle DBs date format #E (MonDDYY). – Tapis 14/11, 2022 at 2:31

If you can identify the segments that actually contain the date information, parsing them can be fairly simple with parsedatetime. There are a few things to consider though namely that your dates don't have years and you should pick a locale.

>>> import parsedatetime
>>> p = parsedatetime.Calendar()
>>> p.parse("December 15th")
((2013, 12, 15, 0, 13, 30, 4, 319, 0), 1)
>>> p.parse("9/18 11:59 pm")
((2014, 9, 18, 23, 59, 0, 4, 319, 0), 3)
>>> # It chooses 2014 since that's the *next* occurence of 9/18

It doesn't always work perfectly when you have extraneous text.

>>> p.parse("9/19 LAB: Serial encoding")
((2014, 9, 19, 0, 15, 30, 4, 319, 0), 1)
>>> p.parse("9/19 LAB: Serial encoding (Section 2.2)")
((2014, 2, 2, 0, 15, 32, 4, 319, 0), 1)

Honestly, this seems like the kind of problem that would be simple enough to parse for particular formats and pick the most likely out of each sentence. Beyond that, it would be a decent machine learning problem.

Euphemie answered 15/11, 2013 at 6:16 Comment(1)

I guess a better question for me to ask is: what's the best way to automatically identify the segments? Is there some sort of method (besides giant regexes, I guess) to identify the date substring? – Oldest 15/11, 2013 at 6:28

Newer versions of dateparser lib provide search functionality.

Example

from dateparser.search import search_dates

dates = search_dates('Central design committee session Tuesday 10/22 6:30 pm')

Bifrost answered 10/9, 2019 at 21:23 Comment(2)

lib is actually dateparser. I fed it a simple date and it took 5 seconds to process, but on a string with an embedded date, it did findthe date where parser.parse(t) did not. So I use parse(t, fuzzy=True) first and then search_dates if it fails. – Spot 23/1, 2022 at 20:47

Thanks dateparser is the best lib for me. – Recaption 30/4, 2023 at 2:43

Hi I'm not sure bellow approach is machine learning but you may try it:

add some context from outside text, e.g publishing time of text message, posting, now etc. (your text doesn't tell anything about year)
extract all tokens with separator white-space and should get something like this:
```
['Th','Wednesday','9:34pm','7:34','pm','am','9/18','9/','/18', '19','12']
```
process them with rule-sets e.g subsisting from weekdays and/or variations of components forming time and mark them e.g. '%d:%dpm', '%d am', '%d/%d', '%d/ %d' etc. may means time. Note that it may have compositions e.g. "12 / 31" is 3gram ('12','/','31') should be one token "12/31" of interest.
"see" what tokens are around marked tokens like "9:45pm" e.g ('Th",'9/19','9:45pm') is 3gram formed from "interesting" tokens and apply rules about it that may determine meaning.
process for more specific analysis for example if have 31/12 so 31 > 12 means d/m, or vice verse, but if have 12/12 m,d will be available only in context build from text and/or outside.

Cheers

Sotelo answered 15/11, 2013 at 8:48 Comment(0)

Here are simple python function using regex for extracting the date from input text

import re

def extract_date(text):
    date_pattern = re.compile(r'\b(?:\d{1,4}[-/.]\d{1,2}[-/.]\d{2,4}|\d{1,2}[-/.]\d{1,2}[-/.]\d{2,4})\b')
    dates = re.findall(date_pattern, text)
    return dates

text= "monkey 2010-07-10 love banana"
extracted_dates = extract_date(text)

if extracted_dates:
    print("Extracted Dates:", extracted_dates)
else:
    print("No dates found.")

OUTPUT: "Extracted Dates: ['2010-07-10']"

Mullion answered 25/10, 2023 at 14:40 Comment(0)

-1

There is no any perfact solution. IT's completely depend on which type of data u are suppose to work. Quickly review and analyze data by going through certain set of data manually and prepare regex pattern and test it wheather it is working or not.

Predefined all packages solve a date extraction problem up to some extent and it is limited one. if one will approximately find out pattern by looking to data then user can prepare regex. It will help them to prevent to iterate and loop over all rules written in packages.

Darladarlan answered 7/5, 2021 at 10:50 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags