The method startswith
allows a string or tuple as its first argument:
# Option 1
new_df = df[df['Office'].str.startswith(('N','M','V','R'), na=False)
Example:
df = pd.DataFrame(data=[np.nan, 'Austria', 'Norway', 'Madagascar', 'Romania', 'Spain', 'Uruguay', 'Yemen'], columns=['Office'])
print(df)
df.Office.str.startswith(('N','M','V','R'), na=False)
Output:
Office
0 NaN
1 Austria
2 Norway
3 Madagascar
4 Romania
5 Spain
6 Uruguay
7 Yemen
0 False
1 False
2 True
3 True
4 True
5 False
6 False
7 False
Other options as pointed out by @MaxU are:
# Option 2
df[df['Office'].str.contains("^(?:N|M|V|R)")]
# Option 3
df[df['Office'].str.contains("^[NMVR]+")]
Performace (non exahustive test):
from datetime import datetime
n = 100000
start_time = datetime.now()
for i in range(n):
df['Office'].str.startswith(('N','M','V','R'), na=False)
print ("Option 1: ", datetime.now() - start_time)
start_time = datetime.now()
for i in range(n):
df['Office'].str.contains("^(?:N|M|V|R)", na=False)
print ("Option 2: ", datetime.now() - start_time)
start_time = datetime.now()
for i in range(n):
df['Office'].str.contains("^[NMVR]+", na=False)
print ("Option 3: ", datetime.now() - start_time)
Result:
Option 1: 0:00:22.952533
Option 2: 0:00:23.502708
Option 3: 0:00:23.733182
Final choice: there is not much difference in time, so since sintax is simpler and performace is better, I would choose option 1.
new_df = df[df['Office'].str.contains("^N|^M|^V|^R", na=False)]
should work – Roderick^
means startswith following character so we usecontains
here because it supports regex patterns – Roderick