Datetime conversion - How to extract the inferred format?
Asked Answered
L

3

30

Here's an array of datetime values:

array = np.array(['2016-05-01T00:00:59.3+10:00', '2016-05-01T00:02:59.4+10:00',
                  '2016-05-01T00:03:59.4+10:00', '2016-05-01T00:13:00.1+10:00',
                  '2016-05-01T00:22:00.5+10:00', '2016-05-01T00:31:01.1+10:00'],
        dtype=object)

pd.to_datetime is very good at inferring datetime formats.

array = pd.to_datetime(array)

print(array)
DatetimeIndex(['2016-04-30 14:00:59.300000', '2016-04-30 14:02:59.400000',
               '2016-04-30 14:03:59.400000', '2016-04-30 14:13:00.100000',
               '2016-04-30 14:22:00.500000', '2016-04-30 14:31:01.100000'],
              dtype='datetime64[ns]', freq=None)

How can I dynamically figure out what datetime format pd.to_datetime inferred? Something like: %Y-%m-%dT... (sorry, my datetime foo is really bad).

Laynelayney answered 20/10, 2017 at 5:20 Comment(8)
Ohk you want to get the format in the form of string after converting into dateteime?Terrie
@Bharathshetty Correct, I'd like to know what the format was, that pd.to_datetime inferred.Laynelayney
@coldspeed theres a bit of mechanism they use _guess_datetime_format method beforing infering the format in main library but when I do the same Im not getting the proper output. Let me post solution as soon as I get that rightTerrie
"welcome to hell" was early pandas advice from Wes. Having dug into the datetime parsing a long time ago it's... kinda wild.Mt
@cᴏʟᴅsᴘᴇᴇᴅ is it possible to post the complete expected output?Terrie
@Bharath I’ll tell you the format I’m looking for includes the format strings for the hours, minutes, seconds, milliseconds, the time zone and so on. I’ll try and figure out exactly the format, but I think this is the ISO format.Laynelayney
There is a date inferring library available for this: github.com/wdm0006/dateinferCauserie
@denfromufa If you're worked with it, you could post an answer!Laynelayney
H
36

I don't think it's possible to do this in full generality in pandas.

As mentioned in other comments and answers, the internal function _guess_datetime_format is close to being what you ask for, but it has strict criteria for what constitutes a guessable format and so it will only work for a restricted class of datetime strings.

These criteria are set out in the _guess_datetime_format function on these lines and you can also see some examples of good and bad formats in the test_parsing script.

Some of the main points are:

  • year, month and day must each be present and identifiable
  • the year must have four digits
  • exactly six digits must be used if using microseconds
  • you can't specify a timezone

This means that it will fail to guess the format for datetime strings in the question despite them being a valid ISO 8601 format:

>>> from pandas.core.tools.datetimes import _guess_datetime_format_for_array
>>> array = np.array(['2016-05-01T00:00:59.3+10:00'])
>>> _guess_datetime_format_for_array(array)
# returns None

In this case, dropping the timezone and padding the microseconds to six digits is enough to make pandas to recognise the format:

>>> array = np.array(['2016-05-01T00:00:59.300000']) # six digits, no tz
>>> _guess_datetime_format_for_array(array)
'%Y-%m-%dT%H:%M:%S.%f'

This is probably as good as it gets.

If pd.to_datetime is not asked to infer the format of the array, or given a format string to try, it will just try and parse each string separately and hope that it is successful. Crucially, it does not need to infer a format in advance to do this.

First, pandas parses the string assuming it is (approximately) a ISO 8601 format. This begins in a call to _string_to_dts and ultimately hits the low-level parse_iso_8601_datetime function that does the hard work.

You can check if your string is able to be parsed in this way using the _test_parse_iso8601 function. For example:

from pandas._libs.tslib import _test_parse_iso8601

def is_iso8601(string):
    try:
        _test_parse_iso8601(string)
        return True
    except ValueError:
        return False

The dates in the array you give are recognised as this format:

>>> is_iso8601('2016-05-01T00:00:59.3+10:00')
True

But this doesn't deliver what the question asks for and I don't see any realistic way to recover the exact format that is recognised by the parse_iso_8601_datetime function.

If parsing the string as a ISO 8601 format fails, pandas falls back to using the parse() function from the third-party dateutil library (called by parse_datetime_string). This allows a fantastic level of parsing flexibility but, again, I don't know of any good way to extract the recognised datetime format from this function.

If both of these two parsers fail, pandas either raises an error, ignores the string or defaults to NaT (depending on what the user specifies). No further attempt is made to parse the string or guess the format of the string.

Hahnke answered 31/10, 2017 at 22:35 Comment(2)
This is a very well written answer, thank you. I now see the inherent limitations of these libraries. Things start to make sense once you understand that parsing the date and inferring the format string are two separate things.Laynelayney
Glad it was helpful. It may be that pandas' functionality is expanded one day to include more formats as inference can often provide a significant performance boost.Hahnke
C
12

DateInfer (PyDateInfer) library allows to infer dates based on the sequence of available dates:

github.com/wdm0006/dateinfer

Usage from docs:

>>> import dateinfer
>>> dateinfer.infer(['Mon Jan 13 09:52:52 MST 2014', 'Tue Jan 21 15:30:00 EST 2014'])
'%a %b %d %H:%M:%S %Z %Y'
>>>

Disclaimer: I have used and then contributed to this library

Causerie answered 2/11, 2017 at 19:30 Comment(0)
T
6

You can use _guess_datetime_format from core.tools to get the format. ie

from pandas.core.tools import datetimes as tools
tools._guess_datetime_format(pd.to_datetime(array).format()[0][:10])

Output :

'%Y-%m-%d'

To know more about this method you can see here. Hope it helps.

Terrie answered 20/10, 2017 at 6:6 Comment(12)
Actually, in this case, it doesn't get there for that you need infer_datetime_format and format is None. In OP's case it goes to tslib.array_to_datetime I guess.Anthropomorphous
Absolutely I was wondering why I'm getting None while doing so. Maybe Im stubborn to move forwardTerrie
I agree the answer is hiding in array_to_datetime. Give me a bit of timeTerrie
@cᴏʟᴅsᴘᴇᴇᴅ Good you posted the question now Im stuck. Maybe we need to write our method.Terrie
It's weird as I seem to just get None from this method github.com/pandas-dev/pandas/blob/v0.20.3/pandas/core/tools/…, you'd think it would return a guess from array[0].Mt
I thought so.It didn't work so I just passed only the date using [:10] not the time. As Zero said its being sent to tslib.array_to_datetime. and Array_to_datetime felt like chinese. I got stuckTerrie
Thanks for all your help guys.Laynelayney
@cᴏʟᴅsᴘᴇᴇᴅ wait for day or two before accepting the answer in case you want to accept. Someone might come up with the best oneTerrie
yeah, I'll hold back, since the crux of my answer (getting the format, not just the y/m/d one) was not addressed. Thanks again Bharath.Laynelayney
I think the rest of the format is same as always HH:MM:SS right? Format issue is only with month, day and year I guess. I dont know much other than thatTerrie
@Bharathshetty (That's c#) and I was looking for a pandas util that could do this for me. ;-) Similar to pandas' CSV sniffer to infer CSV delimiters.Laynelayney
I deleted my comment realizing that they use so many different libraries :)Terrie

© 2022 - 2024 — McMap. All rights reserved.