Using Apply in Pandas Lambda functions with multiple if statements
Asked Answered
A

4

19

I'm trying to infer a classification according to the size of a person in a dataframe like this one:

      Size
1     80000
2     8000000
3     8000000000
...

I want it to look like this:

      Size        Classification
1     80000       <1m
2     8000000     1-10m
3     8000000000  >1bi
...

I understand that the ideal process would be to apply a lambda function like this:

df['Classification']=df['Size'].apply(lambda x: "<1m" if x<1000000 else "1-10m" if 1000000<x<10000000 else ...)

I checked a few posts regarding multiple ifs in a lambda function, here is an example link, but that synthax is not working for me for some reason in a multiple ifs statement, but it was working in a single if condition.

So I tried this "very elegant" solution:

df['Classification']=df['Size'].apply(lambda x: "<1m" if x<1000000 else pass)
df['Classification']=df['Size'].apply(lambda x: "1-10m" if 1000000 < x < 10000000 else pass)
df['Classification']=df['Size'].apply(lambda x: "10-50m" if 10000000 < x < 50000000 else pass)
df['Classification']=df['Size'].apply(lambda x: "50-100m" if 50000000 < x < 100000000 else pass)
df['Classification']=df['Size'].apply(lambda x: "100-500m" if 100000000 < x < 500000000 else pass)
df['Classification']=df['Size'].apply(lambda x: "500m-1bi" if 500000000 < x < 1000000000 else pass)
df['Classification']=df['Size'].apply(lambda x: ">1bi" if 1000000000 < x else pass)

Works out that "pass" seems not to apply to lambda functions as well:

df['Classification']=df['Size'].apply(lambda x: "<1m" if x<1000000 else pass)
SyntaxError: invalid syntax

Any suggestions on the correct synthax for a multiple if statement inside a lambda function in an apply method in Pandas? Either multi-line or single line solutions work for me.

Argumentation answered 19/2, 2018 at 18:34 Comment(4)
You can just use a function.Latent
How would that look like @AntonvBR?Argumentation
@abutremutante Write a function to do the work and pass the name as an argument to apply.Stidham
Have you looked at pd.cut or categories?Tendinous
L
22

Here is a small example that you can build upon:

Basically, lambda x: x.. is the short one-liner of a function. What apply really asks for is a function which you can easily recreate yourself.

import pandas as pd

# Recreate the dataframe
data = dict(Size=[80000,8000000,800000000])
df = pd.DataFrame(data)

# Create a function that returns desired values
# You only need to check upper bound as the next elif-statement will catch the value
def func(x):
    if x < 1e6:
        return "<1m"
    elif x < 1e7:
        return "1-10m"
    elif x < 5e7:
        return "10-50m"
    else:
        return 'N/A'
    # Add elif statements....

df['Classification'] = df['Size'].apply(func)

print(df)

Returns:

        Size Classification
0      80000            <1m
1    8000000          1-10m
2  800000000            N/A
Latent answered 19/2, 2018 at 18:37 Comment(2)
I tried the approaches listed and find creating your own function is much more flexible and transparent way of doing that could avoid some unintended consequencesSpoilage
Thank you and Yes that might be the case. However for pure performance something like maxU example should be used!Latent
D
8

The apply lambda function actually does the job here, I just wonder what the problem was.... as your syntax looks ok and it works....

df1= [80000, 8000000, 8000000000, 800000000000]
df=pd.DataFrame(df1)
df.columns=['size']
df['Classification']=df['size'].apply(lambda x: '<1m' if x<1000000  else '1-10m' if 1000000<x<10000000 else '1bi')
df

Output:

table

Delft answered 10/6, 2020 at 14:34 Comment(0)
O
7

You can use pd.cut function:

bins = [0, 1000000, 10000000, 50000000, ...]
labels = ['<1m','1-10m','10-50m', ...]

df['Classification'] = pd.cut(df['Size'], bins=bins, labels=labels)
Odeliaodelinda answered 19/2, 2018 at 18:40 Comment(0)
S
2

Using Numpy's searchsorted

labels = np.array(['<1m', '1-10m', '10-50m', '>50m'])
bins = np.array([1E6, 1E7, 5E7])

# Using assign is my preference as it produces a copy of df with new column
df.assign(Classification=labels[bins.searchsorted(df['Size'].values)])

If you wanted to produce new column in existing dataframe

df['Classification'] = labels[bins.searchsorted(df['Size'].values)]

Some Explanation

From Docs:np.searchsorted

Find indices where elements should be inserted to maintain order.

Find the indices into a sorted array a such that, if the corresponding elements in v were inserted before the indices, the order of a would be preserved.

The labels array has a length greater than that of bins by one. Because when something is greater than the maximum value in bins, searchsorted returns a -1. When we slice labels this grabs the last label.

Selfemployed answered 19/2, 2018 at 18:48 Comment(2)
Great of course +1 but is it really needed to use df.assign here. My opinion is that it is less readable.Latent
@AntonvBR I love assign for many reasons. First and foremost, because when OP tries my code, they don't automatically clobber their dataframe. Second, I like the design pattern of producing new dataframes and assigning back to the name better. That said, I'll show both alternatives (-:Selfemployed

© 2022 - 2024 — McMap. All rights reserved.