Python 3 Pandas Select Dataframe using Startswith + or

Asked 25/10, 2016 at 20:29 Answered 18/7, 2019 at 7:36

Solved regex python-3.x pandas dataframe startswith

Looking for the correct syntax to do a str.startswith but I want more than one condition.

The working code I have only returns offices that start with the letter "N":

new_df = df[df['Office'].str.startswith("N", na=False)]

Seeking a code that returns offices that can start with the letters "N","M","V",or "R". The following doesn't seem to work:

    new_df = df[df['Office'].str.startswith("N|M|V|R", na=False)]

What am I missing? Thanks!

Faria answered 25/10, 2016 at 20:29 Comment(3)

IIUC then new_df = df[df['Office'].str.contains("^N|^M|^V|^R", na=False)] should work – Roderick 25/10, 2016 at 20:31

EdChum, add that as an answer it seems to work! Thank you. What do the carrots ^ mean? – Faria 25/10, 2016 at 20:41

MaxU's answer pattern is better, basically ^ means startswith following character so we use contains here because it supports regex patterns – Roderick 25/10, 2016 at 20:42

Try this:

df[df['Office'].str.contains("^(?:N|M|V|R)")]

or:

df[df['Office'].str.contains("^[NMVR]+")]

Demo:

In [91]: df
Out[91]:
        Office
0        No-No
1         AAAA
2    MicroHard
3       Valley
4        vvvvv
5   zzzzzzzzzz
6  Risk is fun

In [92]: df[df['Office'].str.contains("^(?:N|M|V|R)")]
Out[92]:
        Office
0        No-No
2    MicroHard
3       Valley
6  Risk is fun

In [93]: df[df['Office'].str.contains("^[NMVR]+")]
Out[93]:
        Office
0        No-No
2    MicroHard
3       Valley
6  Risk is fun

Dowsabel answered 25/10, 2016 at 20:31 Comment(0)

The method startswith allows a string or tuple as its first argument:

# Option 1
new_df = df[df['Office'].str.startswith(('N','M','V','R'), na=False)

Example:

df = pd.DataFrame(data=[np.nan, 'Austria', 'Norway', 'Madagascar', 'Romania', 'Spain', 'Uruguay', 'Yemen'], columns=['Office'])
print(df)
df.Office.str.startswith(('N','M','V','R'), na=False)

Output:

       Office
0         NaN
1     Austria
2      Norway
3  Madagascar
4     Romania
5       Spain
6     Uruguay
7       Yemen


0    False
1    False
2     True
3     True
4     True
5    False
6    False
7    False

Other options as pointed out by @MaxU are:

# Option 2
df[df['Office'].str.contains("^(?:N|M|V|R)")]

# Option 3
df[df['Office'].str.contains("^[NMVR]+")]

Performace (non exahustive test):

from datetime import datetime

n = 100000

start_time = datetime.now()
for i in range(n):
    df['Office'].str.startswith(('N','M','V','R'), na=False)
print ("Option 1: ", datetime.now() - start_time)

start_time = datetime.now()
for i in range(n):
    df['Office'].str.contains("^(?:N|M|V|R)", na=False)
print ("Option 2: ", datetime.now() - start_time)

start_time = datetime.now()
for i in range(n):
    df['Office'].str.contains("^[NMVR]+", na=False)
print ("Option 3: ", datetime.now() - start_time)

Result:

Option 1:  0:00:22.952533
Option 2:  0:00:23.502708
Option 3:  0:00:23.733182

Final choice: there is not much difference in time, so since sintax is simpler and performace is better, I would choose option 1.

Defy answered 18/7, 2019 at 7:36 Comment(0)

Recommended topics

Hot tags