customize dateutil.parser century inference logic
Asked Answered
B

3

7

I am working on old text files with 2-digit years where the default century logic in dateutil.parser doesn't seem to work well. For example, the attack on Pearl Harbor was not on dparser.parse("12/7/41") (which returns 2041-12-7).

The buit-in century "threshold" to roll back into the 1900's seems to happen at 66:

import dateutil.parser as dparser
print(dparser.parse("12/31/65")) # goes forward to 2065-12-31 00:00:00
print(dparser.parse("1/1/66")) # goes back to 1966-01-01 00:00:00

For my purposes I would like to set this "threshold" at 17, so that:

  • "12/31/16" parses to 2016-12-31 (yyyy-mm-dd)
  • "1/1/17" parses to 1917-01-01

But I would like to continue to use this module as its fuzzy match seems to be working well.

The documentation doesn't identify a parameter for doing this... is there an argument I'm overlooking?

Belloir answered 25/7, 2016 at 20:41 Comment(0)
D
5

This isn't particularly well documented but you can actually override this using dateutil.parser. The second argument is a parserinfo object, and the method you'll be concerned with is convertyear. The default implementation is what's causing you problems. You can see that it is basing its interpretation of the century on the current year, plus or minus fifty years. That's why you're seeing the transition at 1966. Next year it will be 1967. :)

Since you are using this personally and may have very specific needs, you don't have to be super-generic. You could do something as simple as this if it works for you:

from dateutil.parser import parse, parserinfo

class MyParserInfo(parserinfo):
    def convertyear(self, year, *args, **kwargs):
        if year < 100:
            year += 1900
        return year

parse('1/21/47', MyParserInfo())
# datetime.datetime(1947, 1, 21, 0, 0)
Damson answered 25/7, 2016 at 20:57 Comment(6)
See this bug report. The recommended course of action is to subclass and override convertyear.Blackshear
Hey, cool. Great minds think alike! (I wasn't aware of that report.)Damson
It's probably worth noting that in the forthcoming 2.6.0, a century_specified flag will also be passed to convertyear, to differentiate between 0099-04-20 and 99-04-20. This implementation with **kwargs should cover that.Blackshear
I am getting a TypeError: convertyear() takes 2 positional arguments but 3 were given on the above solution. However, I am only on 2.5.1 of dateutil so let me see if updating makes a differenceBelloir
@Belloir Looks like the change was actually in 2.5.0. Replace **kwargs with *args or *args, **kwargs and you should be good.Blackshear
Also, note that if it's possible that you'll get years like 0095 or 057 (in the first century AD), you should explicitly change the signature to year, century_specified=False and process accordingly.Blackshear
B
3

You can also post-process the extracted dates manually changing the century if the extracted year is more than a specified threshold, in your case - 2016:

import dateutil.parser as dparser

THRESHOLD = 2016

date_strings = ["12/31/65", "1/1/66", "12/31/16", "1/1/17"]
for date_string in date_strings:
    dt = dparser.parse(date_string)
    if dt.year > THRESHOLD:
        dt = dt.replace(year=dt.year - 100)
    print(dt)

Prints:

1965-12-31 00:00:00
1966-01-01 00:00:00
2016-12-31 00:00:00
1917-01-01 00:00:00
Boarish answered 25/7, 2016 at 20:50 Comment(4)
Thanks -- for my use case, which has mixed types, I can't quite afford to dock every date over the threshold because sometimes the century is explicit. Consider: print(dparser.parse("The Soviets tested their first A-bomb on 8/29/49", fuzzy = True)); print(dparser.parse("Scientists promise flying atomic cars by the year 2020", fuzzy = True))Belloir
It seems to be configurable but in an obscure way. The documentation is hardly clear. I had to poke around in the source code to figure out a way (which, in fairness, is linked from the docs).Damson
@Two-BitAlchemist a nice find, indeed! Thanks!Boarish
@Belloir good examples that break this solution apart. Thanks.Boarish
T
3

Other than writing your own parserinfo.convertyear method, you can customize this by passing a standard parserinfo object with changed _century and _year settings *):

from dateutil.parser import parse, parserinfo
info = parserinfo()
info._century = 1900
info._year  = 1965
parse('12/31/65', parserinfo=info)
=> 1965-12-31 00:00:00

_century specifies the default years added to whatever year number is parsed, i.e. 65 + 1900 = 1965.

_year specifies the cut-off year +- 50. Any year at least 50 years off of _years, i.e. where the difference is

  • < _year will be switched to the next century
  • >= _year will be switched to the previous century

Think of this as a timeline:

1900          1916          1965          2015
+--- (...) ---+--- (...) ---+--- (...) ---+
^             ^             ^             ^
_century      _year - 49    _year         _year + 50

parsed years:
              16,17,...             99,00,...15

In other words, the years 00, 01, ..., 99 are mapped to the time range _year - 49 .. _year + 50 with _year set to the middle of this 100-year period. Using these two settings you can thus specify any cut off you like.

*) Note these two variables are undocumented however are used in the default implementation for parserinfo.convertyear in the newest stable version at the time of writing, 2.5.3. IMHO the default implementation is quite smart.

Trifling answered 25/7, 2016 at 21:42 Comment(2)
I would not recommend relying on private variables, as they are not guaranteed to exist in later versions. In the case of these variables, I imagine they will soon be removed in favor of a public interface.Blackshear
I see your point, however you can always implement your own parserinfo.convertyear and thus preserve the behavior should dateutil choose to change theirs.Trifling

© 2022 - 2024 — McMap. All rights reserved.