Flag Daylight Saving Time (DST) Hours in Pandas Date-Time Column
Asked Answered
A

7

8

I created an hourly dates dataframe, and now I would like to create a column that flags whether each row (hour) is in Daylight Saving Time or not. For example, in summer hours, the flag should == 1, and in winter hours, the flag should == 0.

# Localized dates dataframe
dates = pd.DataFrame(data=pd.date_range('2018-1-1', '2019-1-1', freq='h', tz='America/Denver'), columns=['date_time'])

# My failed attempt to create the flag column
dates['dst_flag'] = np.where(dates['date_time'].dt.daylight_saving_time == True, 1, 0)
Acquah answered 23/9, 2018 at 15:49 Comment(4)
This might have to be manual. Extract the boundary dates and perform some conditional tests. Related: get the DST boundaries of a given timezone in python.Haphazardly
Seems like an XY problem. Do you really care about daylight savings time, or do you care about converting times between timezones? If it's the latter, use UTC internally and convert to the local timezone when you output.Dewey
@MattMessersmith, I care about identifying DST hours, since I'm using it as a dummy variable in a regression. It affects electricity consumption patterns since consumption depends on daylight.Acquah
Oh okay, interesting! This is not an XY problem then. Thanks for the clarification, just wanted to make sure.Dewey
D
4

There's a nice link in the comments that at least let you do this manually. AFAIK, there isn't a vectorized way to do this.

import pandas as pd
import numpy as np
from pytz import timezone

# Generate data (as opposed to index)                                                                                                                                                                                  
date_range = pd.to_datetime(pd.date_range('1/1/2018', '1/1/2019', freq='h', tz='America/Denver'))
date_range = [date for date in date_range]

# Localized dates dataframe                                                                                                                                                           
df = pd.DataFrame(data=date_range, columns=['date_time'])

# Map transition times to year for some efficiency gain                                                                                                                                                     
tz = timezone('America/Denver')
transition_times = tz._utc_transition_times[1:]
transition_times = [t.astimezone(tz) for t in transition_times]
transition_times_by_year = {}
for start_time, stop_time in zip(transition_times[::2], transition_times[1::2]):
    year = start_time.year
    transition_times_by_year[year] = [start_time, stop_time]

# If the date is in DST, mark true, else false                                                                                                                                                              
def mark_dst(dates):
    for date in dates:
        start_dst, stop_dst = transition_times_by_year[date.year]
        yield start_dst <= date <= stop_dst
df['dst_flag'] = [dst_flag for dst_flag in mark_dst(df['date_time'])]

# Do a quick sanity check to make sure we did this correctly for year 2018                                                                                                                                  
dst_start = df[df['dst_flag'] == True]['date_time'][0] # First dst time 2018
dst_end = df[df['dst_flag'] == True]['date_time'][-1] # Last dst time 2018
print(dst_start)
print(dst_end)

this outputs:

2018-03-11 07:00:00-06:00
2018-11-04 06:00:00-07:00

which is likely correct. I didn't do the UTC conversions by hand or anything to check that the hours are exactly right for the given timezone. You can at least verify the dates are correct with a quick google search.

Some gotchas:

  1. pd.date_range generates an index, not data. I changed your original code slightly to make it be data as opposed to the index. I assume you have the data already.

  2. There's something goofy about how tz._utc_transition_times is structured. It's start/stop utc DST transition times, but there is some goofy stuff in the early dates. It should be good from 1965 onward though. If you are doing dates earlier than that change tz._utc_transition_times[1:] to tz._utc_transition_times. Note not all years before 1965 are present.

  3. tz._utc_transition_times is "Python private". It is liable to change without warning or notice, and may or may not work for future or past versions of pytz. I'm using pytz verion 2017.3. I recommend you run this code to make sure the output matches, and if not, make sure to use version 2017.3.

HTH, good luck with your research/regression problem!

Dewey answered 23/9, 2018 at 19:17 Comment(10)
I really appreciate your help with this. Unfortunately, it's giving me an error when I run this line: transition_times = [t.astimezone(tz) for t in transition_times]. The error is: OSError: [Errno 22] Invalid argument. Maybe it's because I'm using pytz==2018.4Acquah
That is a very strange error. Can you print out the first couple values of tz._utc_transition_times[1:]? Maybe just print tz._utc_transition_times[1:5]. What version of pytz are you using?Dewey
Out[183]: [datetime.datetime(1901, 12, 13, 20, 45, 52), datetime.datetime(1918, 3, 31, 9, 0), datetime.datetime(1918, 10, 27, 8, 0), datetime.datetime(1919, 3, 30, 9, 0)] and I'm using pytz==2018.4Acquah
That looks right. Can you give more of the stack trace? You're certain you have the right line? Have you tried running the example exactly as it is (NOT copy/pasting into your application: but just into a new file and trying it from the new file)?Dewey
Sorry, I'm still pretty new to Python. Here's what I think you're asking for. I copied your code into a completely new file. Traceback (most recent call last): File "<ipython-input-185-ed30519c4311>", line 1, in <module> transition_times = [t.astimezone(tz) for t in transition_times] File "<ipython-input-185-ed30519c4311>", line 1, in <listcomp> transition_times = [t.astimezone(tz) for t in transition_times] OSError: [Errno 22] Invalid argumentAcquah
FYI, I downgraded from pytz==2018.4 to 2017.3 and I got the same error.Acquah
Did you copy it into a new file, or are you running it in an ipython REPL? If you pasted it into an interactive prompt, that's not what I'm suggesting. Try copying it into a new file on your filesystem, call it test.py, then do python test.py from your shell. It's odd that the error has something like ipython-input.... Make sure it's a clean environment: previous variables could inadvertently be blasting over the values posted above.Dewey
Let us continue this discussion in chat.Acquah
you are correct--this solution works perfectly in Linux, but not in Windows. I have no idea why. I tried it on two completely different Windows computers (one at home, and one at work), and neither works. Then I tried it on Ubuntu and it worked perfectly. Very interesting, and perhaps worth its own stackoverflow question!Acquah
This doesn't work in the southern hemisphere, if you set tz = timezone('Australia/Sydney') then transition_times_by_year[2018][0] gives the start of winter not summer. I asked the same question and I think this solution works #53292678Racon
R
3

If you are looking for a vectorized way of doing this (which you probably should be), you can use something like the code below.

The fundamental idea behind this is to find the difference between the current time in your timezone and the UTC time. In the winter months, the difference will be one extra hour behind UTC. Whatever the difference is, add what is needed to get to the 1 or 0 for the flag.

In Denver, summer months are UTC-6 and winter months are UTC-7. So, if you take the difference between the tz-aware time in Denver and UTC time, then add 7, you'll get a value of 1 for summer months and a value of 0 for winter months.

import pandas as pd

start = pd.to_datetime('2020-10-30')
end = pd.to_datetime('2020-11-02')
dates = pd.date_range(start=start, end=end, freq='h', tz='America/Denver')
df1 = pd.DataFrame({'dst_flag': 1, 'date1': dates.tz_localize(None)}, index=dates)

# add extra day on each end so that there are no nan's after the join    
dates = pd.to_datetime(pd.date_range(start=start - pd.to_timedelta(1, 'd'), end=end + pd.to_timedelta(1, 'd'), freq='h'), utc=True)
df2 = pd.DataFrame({'date2': dates.tz_localize(None)}, index=dates)
    
out = df1.join(df2)
out['dst_flag'] = (out['date1'] - out['date2']) / pd.to_timedelta(1, unit='h') + 7
out.drop(columns=['date1', 'date2'], inplace=True)
Ruthenian answered 5/11, 2020 at 6:42 Comment(0)
A
1

Here is what I ended up doing, and it works for my purposes:

import pandas as pd
import pytz

# Create dates table and flag Daylight Saving Time dates
dates = pd.DataFrame(data=pd.date_range('2018-1-1', '2018-12-31-23', freq='h'), columns=['date_time'])

# Create a list of start and end dates for DST in each year, in UTC time
dst_changes_utc = pytz.timezone('America/Denver')._utc_transition_times[1:]

# Convert to local times from UTC times and then remove timezone information
dst_changes = [pd.Timestamp(i).tz_localize('UTC').tz_convert('America/Denver').tz_localize(None) for i in dst_changes_utc]

flag_list = []
for index, row in dates['date_time'].iteritems():
    # Isolate the start and end dates for DST in each year
    dst_dates_in_year = [date for date in dst_changes if date.year == row.year]
    spring = dst_dates_in_year[0]
    fall = dst_dates_in_year[1]
    if (row >= spring) & (row < fall):
        flag = 1
    else:
        flag = 0
    flag_list.append(flag)
print(flag_list)
dates['dst_flag'] = flag_list
del(flag_list)
Acquah answered 23/9, 2018 at 21:49 Comment(0)
F
1

The other answers all seem to rely on other dependencies in addition to Pandas and/or go through many operations to get to the result.

This one-liner should get you all the way in one go without inconvenient conversions etc.:

dates = pd.DataFrame(data=pd.date_range('2018-1-1', '2019-1-1', freq='h', tz='America/Denver'), columns=['date_time'])

dates["dst_flag"]=dates.date_time.map(lambda date: int(date.dst() > pd.Timedelta(0)))
Fagaly answered 11/6, 2024 at 12:24 Comment(0)
E
0

the following vectorized way seem to work fine. The idea behind is the same as Nick Klavoht's idea : find the difference between the current time in your timezone and the utc time.

# Localized dates dataframe
df = pd.DataFrame(data=pd.date_range('2018-1-1', '2019-1-1', freq='h', tz='America/Denver'), columns=['date_time'])

df['utc_offset'] = df['date_time'].dt.strftime('%z').str[0:3].astype(float)
df['utc_offset_shifted'] = df['utc_offset'].shift(-1)
df['dst'] = df['utc_offset'] - df['utc_offset_shifted']
df_dst = df[(df['dst'] != 0) & (df['dst'])]
df_dst = df_dst.drop(['utc_offset', 'utc_offset_shifted'], axis=1).reset_index(drop=True)

print(df_dst)

This outputs :

                  date_time  dst
0 2018-03-11 01:00:00-07:00 -1.0
1 2018-11-04 01:00:00-06:00  1.0
Earthlight answered 11/12, 2020 at 17:52 Comment(0)
Y
0

If you know what time zone you are dealing with you could use:

dates['dst_flag'] = dates['date_time'].apply(lambda x: x.tzname() == 'CEST')

This would flag the all hours in CET as False and in CEST as True. I'm not sure if I'd want to do that on a huge column.

Ylla answered 25/5, 2021 at 9:16 Comment(0)
D
0

Convert to python datetime with to_pydatetime(), then datetime.dst() gives timedelta of dst offset.

#time_delta should normally be 0.0 or 3600.0 seconds.
time_delta = [x.dst().total_seconds() for x in dates['date_time'].dt.to_pydatetime()]
dates['dst_flag'] = np.where(np.array(time_delta) < 0.5, 0, 1)
Daleth answered 7/6, 2023 at 9:34 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.