Select DataFrame rows between two dates
Asked Answered
T

14

401

I am creating a DataFrame from a csv as follows:

stock = pd.read_csv('data_in/' + filename + '.csv', skipinitialspace=True)

The DataFrame has a date column. Is there a way to create a new DataFrame (or just overwrite the existing one) which only contains rows with date values that fall within a specified date range or between two specified date values?

Tonicity answered 31/3, 2015 at 13:38 Comment(0)
M
710

There are two possible solutions:

  • Use a boolean mask, then use df.loc[mask]
  • Set the date column as a DatetimeIndex, then use df[start_date : end_date]

Using a boolean mask:

Ensure df['date'] is a Series with dtype datetime64[ns]:

df['date'] = pd.to_datetime(df['date'])  

Make a boolean mask. start_date and end_date can be datetime.datetimes, np.datetime64s, pd.Timestamps, or even datetime strings:

#greater than the start date and smaller than the end date
mask = (df['date'] > start_date) & (df['date'] <= end_date)

Select the sub-DataFrame:

df.loc[mask]

or re-assign to df

df = df.loc[mask]

For example,

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])

yields

            0         1         2       date
153  0.208875  0.727656  0.037787 2000-06-02
154  0.750800  0.776498  0.237716 2000-06-03
155  0.812008  0.127338  0.397240 2000-06-04
156  0.639937  0.207359  0.533527 2000-06-05
157  0.416998  0.845658  0.872826 2000-06-06
158  0.440069  0.338690  0.847545 2000-06-07
159  0.202354  0.624833  0.740254 2000-06-08
160  0.465746  0.080888  0.155452 2000-06-09
161  0.858232  0.190321  0.432574 2000-06-10

Using a DatetimeIndex:

If you are going to do a lot of selections by date, it may be quicker to set the date column as the index first. Then you can select rows by date using df.loc[start_date:end_date].

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2000-6-1':'2000-6-10'])

yields

                   0         1         2
date                                    
2000-06-01  0.040457  0.326594  0.492136    # <- includes start_date
2000-06-02  0.279323  0.877446  0.464523
2000-06-03  0.328068  0.837669  0.608559
2000-06-04  0.107959  0.678297  0.517435
2000-06-05  0.131555  0.418380  0.025725
2000-06-06  0.999961  0.619517  0.206108
2000-06-07  0.129270  0.024533  0.154769
2000-06-08  0.441010  0.741781  0.470402
2000-06-09  0.682101  0.375660  0.009916
2000-06-10  0.754488  0.352293  0.339337

While Python list indexing, e.g. seq[start:end] includes start but not end, in contrast, Pandas df.loc[start_date : end_date] includes both end-points in the result if they are in the index. Neither start_date nor end_date has to be in the index however.


Also note that pd.read_csv has a parse_dates parameter which you could use to parse the date column as datetime64s. Thus, if you use parse_dates, you would not need to use df['date'] = pd.to_datetime(df['date']).

Macassar answered 31/3, 2015 at 13:49 Comment(10)
Setting the date column as the index works well, but it's not clear from the documentation I've seen that one can do that. Thanks.Editorial
@FaheemMitha: I added a link above to where "partial string indexing" is documented.Macassar
The part that is perhaps less clear is that an index has to be explicitly created. And without explicitly creating the index, a restricted range returns an empty set, not an error.Editorial
After the df = df.set_index(['date']) step, I have found the index also needs to be sorted (via df.sort_index(inplace=True, ascending=True)), as otherwise you can get less than full or even empty DataFrame results from df.loc['2000-6-1':'2000-6-10']. And if you use ascending=False, that won't work at all, even if you reverse it with df.loc['2000-6-10':'2000-6-1']Kipper
If you want to keep the 'date' column while still give its value to the index of the dataframe, you can do this df.index = df['date']Wensleydale
For the first method, the datestring needs to be converted datetime.date for the relational operation to work. Otherwise it throws a TypeError. For example pd.to_datetime("2000-6-1").date() made it to work for me.Uprising
Please delete or update this answer; it's years out of date since pandas 0.22 implemented between() on SeriesDisused
using the mask-method, I get the TypeError: 'Series' objects are mutable, thus they cannot be hashed. Any suggestion?Finder
This helped me a lot!!!Life
If the date time is not in the index but a column (called DateTime): df.loc[df.DateTime.between('2022-8-11','2022-8-17')] also worksSynecdoche
A
135

I feel the best option will be to use the direct checks rather than using loc function:

df = df[(df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')]

It works for me.

Major issue with loc function with a slice is that the limits should be present in the actual values, if not this will result in KeyError.

Altissimo answered 13/7, 2016 at 1:0 Comment(4)
I think slices via loc are great. And it seems to me that as unutbu says, Neither start_date nor end_date has to be in the index however.Crappie
how to filter the date as ( 14 days before till current date).. if todays date is 2019-01-15... i need the data from ( 2019-01-01 till 2019-01-15)Gnash
Simple and elegant. Thanks Christin, this is what I was trying to do. Works for me.Ufo
What is the type of df['date'],is that a datatime or string, or both works?Hilda
O
106

You can also use between:

df[df.some_date.between(start_date, end_date)]
Odele answered 5/3, 2018 at 21:28 Comment(2)
Also check between_time: pandas.pydata.org/pandas-docs/version/0.20.3/generated/…Claire
@AntonTarasenko Strangely, does not work with datetimes, but rather only times. Took me a while to realize this distinction. That's how I ended up consulting this thread.Felske
F
33

You can use the isin method on the date column like so df[df["date"].isin(pd.date_range(start_date, end_date))]

Note: This only works with dates (as the question asks) and not timestamps.

Example:

import numpy as np   
import pandas as pd

# Make a DataFrame with dates and random numbers
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')

# Select the rows between two dates
in_range_df = df[df["date"].isin(pd.date_range("2017-01-15", "2017-01-20"))]

print(in_range_df)  # print result

which gives

           0         1         2       date
14  0.960974  0.144271  0.839593 2017-01-15
15  0.814376  0.723757  0.047840 2017-01-16
16  0.911854  0.123130  0.120995 2017-01-17
17  0.505804  0.416935  0.928514 2017-01-18
18  0.204869  0.708258  0.170792 2017-01-19
19  0.014389  0.214510  0.045201 2017-01-20
Folberth answered 23/1, 2017 at 8:47 Comment(2)
I seem to get this error when I'm trying to do this: ValueError: Of the four parameters: start, end, periods, and freq, exactly three must be specifiedCasady
What code are you using? pd.date_range accepts four parameters but you only need at most 3 of these parameters to be set. Since two of the four parameters are optional it's sometimes possible to only specify 2 parameters. However, without seeing your code, I can't tell you how you would fix itFolberth
G
23

pandas 0.22 has a between() function. Makes answering this question easier and more readable code.

# create a single column DataFrame with dates going from Jan 1st 2018 to Jan 1st 2019
df = pd.DataFrame({'dates':pd.date_range('2018-01-01','2019-01-01')})

Let's say you want to grab the dates between Nov 27th 2018 and Jan 15th 2019:

# use the between statement to get a boolean mask
df['dates'].between('2018-11-27','2019-01-15', inclusive=False)

0    False
1    False
2    False
3    False
4    False

# you can pass this boolean mask straight to loc
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=False)]

    dates
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
335 2018-12-02

Notice the inclusive argument. very helpful when you want to be explicit about your range. notice when set to True we return Nov 27th of 2018 as well:

df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]

    dates
330 2018-11-27
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01

This method is also faster than the previously mentioned isin method:

%%timeit -n 5
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
868 µs ± 164 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


%%timeit -n 5

df.loc[df['dates'].isin(pd.date_range('2018-01-01','2019-01-01'))]
1.53 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)

However, it is not faster than the currently accepted answer, provided by unutbu, only if the mask is already created. but if the mask is dynamic and needs to be reassigned over and over, my method may be more efficient:

# already create the mask THEN time the function

start_date = dt.datetime(2018,11,27)
end_date = dt.datetime(2019,1,15)
mask = (df['dates'] > start_date) & (df['dates'] <= end_date)

%%timeit -n 5
df.loc[mask]
191 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
Genniegennifer answered 15/1, 2019 at 21:50 Comment(1)
This was closer to the solution I was looking for. I need to have the dates provided included, since they were user input variables, not just between the dates (i.e. I needed inclusive, not exclusive of dates given). But it is important to note that this df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)] is returning inclusive date of the 2018-11-27, but not inclusive of 2019-01-15, as far as I can tell.Sholeen
S
17

Keeping the solution simple and pythonic, I would suggest you to try this.

In case if you are going to do this frequently the best solution would be to first set the date column as index which will convert the column in DateTimeIndex and use the following condition to slice any range of dates.

import pandas as pd

data_frame = data_frame.set_index('date')

df = data_frame[(data_frame.index > '2017-08-10') & (data_frame.index <= '2017-08-15')]
Saguaro answered 21/9, 2017 at 12:5 Comment(0)
R
10

Another option, how to achieve this, is by using pandas.DataFrame.query() method. Let me show you an example on the following data frame called df.

>>> df = pd.DataFrame(np.random.random((5, 1)), columns=['col_1'])
>>> df['date'] = pd.date_range('2020-1-1', periods=5, freq='D')
>>> print(df)
      col_1       date
0  0.015198 2020-01-01
1  0.638600 2020-01-02
2  0.348485 2020-01-03
3  0.247583 2020-01-04
4  0.581835 2020-01-05

As an argument, use the condition for filtering like this:

>>> start_date, end_date = '2020-01-02', '2020-01-04'
>>> print(df.query('date >= @start_date and date <= @end_date'))
      col_1       date
1  0.244104 2020-01-02
2  0.374775 2020-01-03
3  0.510053 2020-01-04

If you do not want to include boundaries, just change the condition like following:

>>> print(df.query('date > @start_date and date < @end_date'))
      col_1       date
2  0.374775 2020-01-03
Retractor answered 23/3, 2020 at 14:49 Comment(0)
L
9

It is highly recommended to convert a date column to an index. Doing that will give a lot of facilities. One is to select the rows between two dates easily, you can see this example:

import numpy as np   
import pandas as pd

# Dataframe with monthly data between 2016 - 2020
df = pd.DataFrame(np.random.random((60, 3)))
df['date'] = pd.date_range('2016-1-1', periods=60, freq='M')

To select the rows between 2017-01-01 and 2019-01-01, you need only to convert the date column to an index:

df.set_index('date', inplace=True)

and then only slicing:

df.loc['2017':'2019']

You can select the date column as index while reading the csv file directly instead of the df.set_index():

df = pd.read_csv('file_name.csv',index_col='date') 
Leastways answered 13/3, 2022 at 10:45 Comment(0)
M
6
import pandas as pd

technologies = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000],
    'Duration':['30days','50days','55days','40days','60days','35days','55days'],
    'Discount':[1000,2300,1000,1200,2500,1300,1400],
    'InsertedDates':["2021-11-14","2021-11-15","2021-11-16","2021-11-17","2021-11-18","2021-11-19","2021-11-20"]
               })
df = pd.DataFrame(technologies)
print(df)

Using pandas.DataFrame.loc to Filter Rows by Dates

Method 1:

    mask = (df['InsertedDates'] > start_date) & (df['InsertedDates'] <= end_date)

    df2 = df.loc[mask]
    print(df2)

Method 2:

    start_date = '2021-11-15'
    end_date = '2021-11-19'
    after_start_date = df["InsertedDates"] >= start_date
    before_end_date = df["InsertedDates"] <= end_date
    between_two_dates = after_start_date & before_end_date


    df2 = df.loc[between_two_dates]
    print(df2)

Using pandas.DataFrame.query() to select DataFrame Rows

start_date = '2021-11-15'
end_date   = '2021-11-18'
df2 = df.query('InsertedDates >= @start_date and InsertedDates <= @end_date')
print(df2)

Select rows between two dates using DataFrame.query()

start_date = '2021-11-15'
end_date = '2021-11-18'
df2 = df.query('InsertedDates > @start_date and InsertedDates < @end_date')
print(df2)

pandas.Series.between() function Using two dates

df2 = df.loc[df["InsertedDates"].between("2021-11-16", "2021-11-18")]
print(df2)

Select DataFrame rows between two dates using DataFrame.isin()

df2 = df[df["InsertedDates"].isin(pd.date_range("2021-11-15", "2021-11-17"))]
print(df2)
Mclendon answered 29/8, 2022 at 0:58 Comment(0)
R
4

You can use the method truncate:

dates = pd.date_range('2016-01-01', '2016-01-06', freq='d')
df = pd.DataFrame(index=dates, data={'A': 1})

            A
2016-01-01  1
2016-01-02  1
2016-01-03  1
2016-01-04  1
2016-01-05  1
2016-01-06  1

Select data between two dates:

df.truncate(before=pd.Timestamp('2016-01-02'),
            after=pd.Timestamp('2016-01-4'))

Output:

            A
2016-01-02  1
2016-01-03  1
2016-01-04  1
Renn answered 6/10, 2021 at 7:39 Comment(1)
This is the best answer when the date is already a datetime index rather than a column.Justice
S
3

I prefer not to alter the df.

An option is to retrieve the index of the start and end dates:

import numpy as np   
import pandas as pd

#Dummy DataFrame
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')

#Get the index of the start and end dates respectively
start = df[df['date']=='2017-01-07'].index[0]
end = df[df['date']=='2017-01-14'].index[0]

#Show the sliced df (from 2017-01-07 to 2017-01-14)
df.loc[start:end]

which results in:

     0   1   2       date
6  0.5 0.8 0.8 2017-01-07
7  0.0 0.7 0.3 2017-01-08
8  0.8 0.9 0.0 2017-01-09
9  0.0 0.2 1.0 2017-01-10
10 0.6 0.1 0.9 2017-01-11
11 0.5 0.3 0.9 2017-01-12
12 0.5 0.4 0.3 2017-01-13
13 0.4 0.9 0.9 2017-01-14
Scholarship answered 7/2, 2018 at 15:33 Comment(0)
F
2

Inspired by unutbu

print(df.dtypes)                                 #Make sure the format is 'object'. Rerunning this after index will not show values.
columnName = 'YourColumnName'
df[columnName+'index'] = df[columnName]          #Create a new column for index
df.set_index(columnName+'index', inplace=True)   #To build index on the timestamp/dates
df.loc['2020-09-03 01:00':'2020-09-06']          #Select range from the index. This is your new Dataframe.
Footage answered 8/9, 2020 at 11:28 Comment(0)
S
1

you can do it with pd.date_range() and Timestamp. Let's say you have read a csv file with a date column using parse_dates option:

df = pd.read_csv('my_file.csv', parse_dates=['my_date_col'])

Then you can define a date range index :

rge = pd.date_range(end='15/6/2020', periods=2)

and then filter your values by date thanks to a map:

df.loc[df['my_date_col'].map(lambda row: row.date() in rge)]
Susann answered 7/8, 2020 at 13:10 Comment(0)
J
1

I just want to point out that as of pandas 2.1.0, query() can be used even more terse. Basically, instead of chaining two conditions using &, you can reduce it into a single expression:

df = pd.DataFrame({'date': pd.date_range('2000-1-1', None, 200), 'col': range(200)})
# rows between 2000-6-1 and 2000-6-10
df1 = df.query("'2000-6-1' <= date <= '2000-6-10'")


# using local variable
start, end = '2000-6-1', '2000-6-10'                 # using string
df1 = df.query("@start <= date <= @end")


from datetime import datetime
start, end = datetime(2000,6,1), datetime(2000,6,10) # using datetime/pd.Timestamp
df1 = df.query("@start <= index <= @end")

If the index is datetime, then we can use

df = pd.DataFrame({'col': range(200)}, index=pd.date_range('2000-1-1', None, 200))
df1 = df.query("'2000-6-1' <= index <= '2000-6-10'")

Then again, in that case, loc slicing is probably more readable:

df1 = df.loc['2000-6-1':'2000-6-10']
df1 = df.loc[datetime(2000,6,1): datetime(2000,6,10)]

One important note is that these methods (in fact, all methods on this page) work only if the datetime column or the datetime index is datetime64[ns] dtype. If it's something else such as datetime.date or object etc, you must first convert to datetime64[ns] using pd.to_datetime():

df['date'] = pd.to_datetime(df['date'])

# if it's the index:
df.index = pd.to_datetime(df.index)

If the data is read from a CSV file as in the OP, you can pass the parse_dates kwarg. For example, if the datetime column should be the date column in the CSV, you can use:

df = pd.read_csv("my_data.csv", parse_dates=['date'])

If you want to read the datetimes as index from a CSV, then use

df = pd.read_csv("my_data.csv", parse_dates=['date'], index_col=['date'])
Johst answered 22/12, 2023 at 6:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.