pandas out of bounds nanosecond timestamp after offset rollforward plus adding a month offset
Asked Answered
P

7

116

I am confused how pandas blew out of bounds for datetime objects with these lines:

import pandas as pd
BOMoffset = pd.tseries.offsets.MonthBegin()
# here some code sets the all_treatments dataframe and the newrowix, micolix, mocolix counters
all_treatments.iloc[newrowix,micolix] = BOMoffset.rollforward(all_treatments.iloc[i,micolix] + pd.tseries.offsets.DateOffset(months = x))
all_treatments.iloc[newrowix,mocolix] = BOMoffset.rollforward(all_treatments.iloc[newrowix,micolix]+ pd.tseries.offsets.DateOffset(months = 1))

Here all_treatments.iloc[i,micolix] is a datetime set by pd.to_datetime(all_treatments['INDATUMA'], errors='coerce',format='%Y%m%d'), and INDATUMA is date information in the format 20070125.

This logic seems to work on mock data (no errors, dates make sense), so at the moment I cannot reproduce while it fails in my entire data with the following error:

pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2262-05-01 00:00:00
Paramour answered 1/10, 2015 at 12:54 Comment(2)
create a minimal but complete code exampleEmaciation
I managed to fix it by :Whitt
S
183

Since pandas represents timestamps in nanosecond resolution, the timespan that can be represented using a 64-bit integer is limited to approximately 584 years

In [54]: pd.Timestamp.min
Out[54]: Timestamp('1677-09-22 00:12:43.145225')

In [55]: pd.Timestamp.max
Out[55]: Timestamp('2262-04-11 23:47:16.854775807')

And your value is out of this range 2262-05-01 00:00:00 and hence the outofbounds error

Straight out of: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timestamp-limitations

Workaround:

This will force the dates which are outside the bounds to NaT

pd.to_datetime(date_col_to_force, errors = 'coerce')

Salta answered 14/5, 2016 at 12:29 Comment(8)
Do you know if it's possible to have a date index that goes beyond that range?Waller
@Waller see this...Cleo
I knew about that, thanks. The problem is that you can have the index made of the period range that goes beyond the timestamp limits and it's fine until you actually try to use the individual values from that index - then you will either get errors or incorrect dates.Waller
@jetpackdata how can we fix this when converting to TimeDate? it seems like a cath20. like how to compare an input before converting it. there should be a way NaT anything out of range?Willwilla
Is there a workaround for dates that extend beyond the max limit? I have a np.where if ['date'] > today, use other col. Someone fat fingered 3012 instead of 2012 and errors out the equation.Millesimal
@ArthurD.Howland This must help. pandas-docs.github.io/pandas-docs-travis/user_guide/…Chessy
Link has changed: pandas-docs.github.io/pandas-docs-travis/user_guide/…Bogie
But all my dates are before 1600 should I put it all to NaT? This is not a good solution in my case.Scummy
B
69

Setting the errors parameter in pd.to_datetime to 'coerce' causes replacement of out of bounds values with NaT. Quoting the docs:

If ‘coerce’, then invalid parsing will be set as NaT

E.g.:

datetime_variable = pd.to_datetime(datetime_variable, errors = 'coerce')

This does not fix the data (obviously), but still allows processing the non-NaT data points.

Bename answered 9/3, 2018 at 10:33 Comment(3)
This is not recommended if we actually want to consider the value rather blindly converting to NaTChitwood
@Chitwood And where have I claimed otherwise?Bename
hi @pawel: hi, I didn't mean you did claim otherwise, I just meant if we specify errors='ignore' rather 'coerce' we can keep track / retain the original error value for reference.Chitwood
C
21

The reason you are seeing this error message "OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3000-12-23 00:00:00" is because pandas timestamp data type stores date in nanosecond resolution(from the docs).

Which means the date values have to be in the range

pd.Timestamp.min(1677-09-21 00:12:43.145225) and

pd.Timestamp.max(2262-04-11 23:47:16.854775807)

Even if you only want the date with resolution of seconds or microseconds, pandas will still store it internally in nanoseconds. There is no option in pandas to store a timestamp outside of the above mentioned range.

This is surprising because databases like sql server and libraries like numpy allows to store date beyond this range. Also maximum of 64 bits are used in most of the cases to store the date.

But here is the difference. SQL server stores date in nanosecond resolution but only up to a accuracy of 100 ns(as opposed to 1 ns in pandas). Since the space is limited(64 bits), its a matter of range vs accuracy. With pandas timestamp we have higher accuracy but lower date range.

In case of numpy (pandas is built on top of numpy) datetime64 data type,

  • if the date falls in the above mentioned range you can store it in nanoseconds which is similar to pandas.
  • OR you can give up the nanosecond resolution and go with microseconds which will give you a much larger range. This is something that is missing in pandas timestamp type.

However if you choose to store in nanoseconds and the date is outside the range then numpy will automatically wrap around this date and you might get unexpected results (referenced below in the 4th solution).

np.datetime64("3000-06-19T08:17:14.073456178", dtype="datetime64[ns]")
> numpy.datetime64('1831-05-11T09:08:06.654352946')

Now with pandas we have below options,

import pandas as pd
data = {'Name': ['John', 'Sam'], 'dob': ['3000-06-19T08:17:14', '2000-06-19T21:17:14']}
my_df = pd.DataFrame(data)

1)If you are ok with losing the data which is out of range then simply use below param to convert out of range date to NaT(not a time).

my_df['dob'] = pd.to_datetime(my_df['dob'], errors = 'coerce')

enter image description here

2)If you dont want to lose the data then you can convert the values into a python datetime type. Here the column "dob" is of type pandas object but the individual value will be of type python datetime. However doing this we will lose the benefit of vectorized functions.

import datetime as dt
my_df['dob'] = my_df['dob'].apply(lambda x: dt.datetime.strptime(x,'%Y-%m-%dT%H:%M:%S') if type(x)==str else pd.NaT)
print(type(my_df.iloc[0][1]))
> <class 'datetime.datetime'>

enter image description here

3)Another option is to use numpy instead of pandas series if possible. In case of pandas dataframe, you can convert a series(or column in a df) to numpy array. Process the data separately and then join it back to the dataframe.

4)we can also use pandas timespans as suggested in the docs. Do checkout the difference b/w timestamp and period before using this data type. Date range and frequency here works similar to numpy(mentioned above in the numpy section).

my_df['dob'] = my_df['dob'].apply(lambda x: pd.Period(x, freq='ms'))

enter image description here

Collegian answered 9/10, 2021 at 13:35 Comment(0)
B
4

You can try with strptime() in datetime library along with lambda expression to convert text to date values in a series object:

Example:

df['F'].apply(lambda x: datetime.datetime.strptime(x, '%m/%d/%Y %I:%M:%S') if type(x)==str else np.NaN)
Boogie answered 31/8, 2020 at 13:19 Comment(1)
This is perfect - exactly what I needed to solve the nanoseconds out of bounds problem without losing underlying data.Methadone
T
3

None of above are so good, because it will delete your data. But, you can only mantain and edit your conversion:

# convertin from epoch to datatime mantainig the nanoseconds timestamp
xbarout= pd.to_datetime(xbarout.iloc[:,0],unit='ns')
Torticollis answered 18/4, 2020 at 23:22 Comment(0)
S
2

For anyone else that arrived here from using pd.read_sql_query() with the dtype param with datetime values and got this OutOfBoundsDatetime error, you can fix this issue using the parameter parse_dates to set errors='coerce' for the date column. E.g:

df = pd.read_sql_query(sql, conn,  dtype={'date_col': 'datetime64'), parse_dates={'date_col': {'errors': 'coerce'}}):
Sackett answered 2/5, 2023 at 14:51 Comment(0)
E
1

Make sure you are using a current version of Pandas. I was having this issue while trying to upload a table into BigQuery (CloudShell downloads a 1.5.* by default) while reading a parquet file with Pandas and using engine= 'pyarrow':

df = pd.read_parquet(path, engine='pyarrow')

My solution was to upgrade Pandas package version to 2.0.3 and change the engine to 'fastparquet' (which gave me a different error with the older Pandas version):

df = pd.read_parquet(path, engine='fastparquet')
Evolutionary answered 25/7, 2023 at 21:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.