Unexpected results of min() and max() methods of Pandas series made of Timestamp objects
Asked Answered
H

1

6

I encountered this behaviour when doing basic data munging, like in this example:

In [55]: import pandas as pd
In [56]: import numpy as np
In [57]: rng = pd.date_range('1/1/2000', periods=10, freq='4h')
In [58]: lvls = ['A','A','A','B','B','B','C','C','C','C']
In [59]: df = pd.DataFrame({'TS': rng, 'V' : np.random.randn(len(rng)), 'L' : lvls})

In [60]: df
Out[60]: 
   L                  TS         V
0  A 2000-01-01 00:00:00 -1.152371
1  A 2000-01-01 04:00:00 -2.035737
2  A 2000-01-01 08:00:00 -0.493008
3  B 2000-01-01 12:00:00 -0.279055
4  B 2000-01-01 16:00:00 -0.132386
5  B 2000-01-01 20:00:00  0.584091
6  C 2000-01-02 00:00:00 -0.297270
7  C 2000-01-02 04:00:00 -0.949525
8  C 2000-01-02 08:00:00  0.517305
9  C 2000-01-02 12:00:00 -1.142195

the problem:

In [61]: df['TS'].min()
Out[61]: 31969-04-01 00:00:00

In [62]: df['TS'].max()
Out[62]: 31973-05-10 00:00:00

while this looks ok:

In [63]: df['V'].max()
Out[63]: 0.58409076701429163

In [64]: min(df['TS'])
Out[64]: <Timestamp: 2000-01-01 00:00:00>

when aggregating after groupby:

In [65]: df.groupby('L').min()
Out[65]: 
             TS         V
L                        
A  9.466848e+17 -2.035737
B  9.467280e+17 -0.279055
C  9.467712e+17 -1.142195

In [81]: val = df.groupby('L').agg('min')['TS']['A']
In [82]: type(val)
Out[82]: numpy.float64

Apparently in this particular case it has something to do with using frequency datetime index as argument of pd.Series function:

In [76]: rng.min()
Out[76]: <Timestamp: 2000-01-01 00:00:00>

In [77]: ts = pd.Series(rng)
In [78]: ts.min()
Out[78]: 31969-04-01 00:00:00

In [79]: type(ts.min())
Out[79]: numpy.datetime64

However, my initial problem was with min/max of Timestamp series parsed from strings via pd.read_csv()

What am I doing wrong?

Hickerson answered 19/10, 2012 at 12:34 Comment(3)
What version of Numpy are you using? There are problems with the representation of timestamps in 1.6.Dupree
It is all under Numpy 1.6.2 and Pandas 0.9.0, maintained via MacPorts. Following your hint, I have found similar problem reports, just trying to figure out how they relate to my situation.Hickerson
BTW, Meteore, were you (or anyone else) able to replicate this behaviour?Hickerson
I
6

As @meteore points out, it's a problem with the string repr of the np.datetime64 type in NumPy 1.6.x. The underlying data, should still be correct. To workaround this problem, you can do something like:

In [15]: df
Out[15]: 
   L                  TS         V
0  A 2000-01-01 00:00:00  0.752035
1  A 2000-01-01 04:00:00 -1.047444
2  A 2000-01-01 08:00:00  1.177557
3  B 2000-01-01 12:00:00  0.394590
4  B 2000-01-01 16:00:00  1.835067
5  B 2000-01-01 20:00:00 -0.768274
6  C 2000-01-02 00:00:00 -0.564037
7  C 2000-01-02 04:00:00 -2.644367
8  C 2000-01-02 08:00:00 -0.571187
9  C 2000-01-02 12:00:00  1.618557

In [16]: df.TS.astype(object).min()
Out[16]: datetime.datetime(2000, 1, 1, 0, 0)

In [17]: df.TS.astype(object).max()
Out[17]: datetime.datetime(2000, 1, 2, 12, 0)
Immoralist answered 19/10, 2012 at 14:39 Comment(5)
True, but the result of min/max should be boxed as Timestamps. I'll open an issue: github.com/pydata/pandas/issues/2083Isosceles
Thanks! So I should just bypass Timestamp representation until the issue is resolved? Only in this particular case or can this kind of nuisance be generalized?Hickerson
Timestamp is fine. The problem is the numpy.datetime64 representation. Timestamp is actually a subclass of python datetime and as @Wes_Mckinney mentioned, we'll be putting in a fix to box the numpy.datetime64 return value as a Timestamp so it can be represented correctly.Immoralist
The problem with datetime64 is more insidious than it looks: In [5]: pd.Timestamp(df.TS.min()) Out[5]: <Timestamp: 2156-12-27 21:37:20.812867584>Isosceles
Has this been fixed yet? I'm having trouble getting a max timestamp from a series.Heliacal

© 2022 - 2024 — McMap. All rights reserved.