Conditional If Statement: If value in row contains string ... set another column equal to string
Asked Answered
A

6

29

EDIT MADE:

I have the 'Activity' column filled with strings and I want to derive the values in the 'Activity_2' column using an if statement.

So Activity_2 shows the desired result. Essentially I want to call out what type of activity is occurring.

I tried to do this using my code below but it won't run (please see screen shot below for error). Any help is greatly appreciated!

enter image description here

    for i in df2['Activity']:
        if i contains 'email':
            df2['Activity_2'] = 'email'
        elif i contains 'conference'
            df2['Activity_2'] = 'conference'
        elif i contains 'call'
            df2['Activity_2'] = 'call'
        else:
            df2['Activity_2'] = 'task'


Error: if i contains 'email':
                ^
SyntaxError: invalid syntax
Aymer answered 11/5, 2017 at 3:15 Comment(5)
did you try if i == 'email': df2['Activity_2'] = 'email'Grummet
"won't run" is very unhelpfulMulry
thanks for quick response. when I try your above code, there is no 'Activity_2' column in my dataframeAymer
@donk: I have posted my error in my messageAymer
You have a bunch of missing colons on the lines with "elif" statementsBazil
H
12

The current solution behaves wrongly if your df contains NaN values. In that case I recommend using the following code which worked for me

temp=df.Activity.fillna("0")
df['Activity_2'] = pd.np.where(temp.str.contains("0"),"None",
                   pd.np.where(temp.str.contains("email"), "email",
                   pd.np.where(temp.str.contains("conference"), "conference",
                   pd.np.where(temp.str.contains("call"), "call", "task"))))
Hovey answered 13/5, 2019 at 10:51 Comment(1)
Finally, a solution that works and accounts for defaults / NAsBazil
T
39

I assume you are using pandas, then you can use numpy.where, which is a vectorized version of if/else, with the condition constructed by str.contains:

df['Activity_2'] = pd.np.where(df.Activity.str.contains("email"), "email",
                   pd.np.where(df.Activity.str.contains("conference"), "conference",
                   pd.np.where(df.Activity.str.contains("call"), "call", "task")))

df

#   Activity            Activity_2
#0  email personA       email
#1  attend conference   conference
#2  send email          email
#3  call Sam            call
#4  random text         task
#5  random text         task
#6  lwantto call        call
Torchwood answered 11/5, 2017 at 3:31 Comment(2)
@Psidom can you help me with one of my question #52820333Katherine
One does not need to call np from pandas. If you do, you get the following message: " The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly. " Just using np.where() should do the job. It is a good solution suggested by @Psidom. Thank you Psidom!Blackstock
H
14

This also works:

df.loc[df['Activity'].str.contains('email'), 'Activity_2'] = 'email'
df.loc[df['Activity'].str.contains('conference'), 'Activity_2'] = 'conference'
df.loc[df['Activity'].str.contains('call'), 'Activity_2'] = 'call'
Hortensiahorter answered 8/12, 2017 at 10:41 Comment(1)
I realize a couple years old here - but i have thousands of lines like this - how would you implement them efficiently?Rodolphe
H
12

The current solution behaves wrongly if your df contains NaN values. In that case I recommend using the following code which worked for me

temp=df.Activity.fillna("0")
df['Activity_2'] = pd.np.where(temp.str.contains("0"),"None",
                   pd.np.where(temp.str.contains("email"), "email",
                   pd.np.where(temp.str.contains("conference"), "conference",
                   pd.np.where(temp.str.contains("call"), "call", "task"))))
Hovey answered 13/5, 2019 at 10:51 Comment(1)
Finally, a solution that works and accounts for defaults / NAsBazil
D
3

you have an invalid syntax for checking strings.

try using

 for i in df2['Activity']:
        if 'email' in i :
            df2['Activity_2'] = 'email'
Douglassdougy answered 11/5, 2017 at 3:32 Comment(0)
B
2
  1. Your code had bugs- no colons on "elif" lines.
  2. You didn't mention you were using Pandas, but that's the assumption I'm going with.
  3. My answer handles defaults, uses proper Python conventions, is the most efficient, up-to-date, and easily adaptable for additional activities.

DEFAULT_ACTIVITY = 'task'


def assign_activity(todo_item):
    """Assign activity to raw text TODOs
    """
    activities = ['email', 'conference', 'call']

    for activity in activities:
        if activity in todo_item:
            return activity
        else:
            # Default value
            return DEFAULT_ACTIVITY

df = pd.DataFrame({'Activity': ['email person A', 'attend conference', 'call Charly'],
                   'Colleague': ['Knor', 'Koen', 'Hedge']})

# You should really come up with a better name than 'Activity_2', like 'Labels' or something.
df["Activity_2] = df["Activity"].apply(assign_activity)
Bazil answered 11/11, 2021 at 22:50 Comment(0)
M
1

Another solution can be found in a post made by @unutbu. This also works great for creating conditional columns. I changed the example from that post df['Set'] == Z to match your question to df['Activity'].str.contains('yourtext'). See an example below:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Activity': ['email person A', 'attend conference', 'call foo']})

conditions = [
    df['Activity'].str.contains('email'),
    df['Activity'].str.contains('conference'),
    df['Activity'].str.contains('call')]

values = ['email', 'conference', 'call']

df['Activity_2'] = np.select(conditions, values, default='task')

print(df)

You can find the original post here: Pandas conditional creation of a series/dataframe column

Malines answered 4/6, 2021 at 14:14 Comment(3)
Tried this, but all values were just defaultBazil
@DaveLiu the example works perfectly in my Jupyter notebooks instance. Can you further explain your issue? Did copy this 1-on-1 or what did you try?Malines
I don't recall the issue, maybe a pandas/numpy versioning discrepancyBazil

© 2022 - 2024 — McMap. All rights reserved.