Get min and max values of categorical variable in a dataframe
Asked Answered
M

4

17

I have a dataframe that looks like this:

D   X   Y   Z
A  22  16  23
A  21  16  22
A  20  17  21
B  33  50  11
B  34  53  12
B  34  55  13
C  44  34  11
C  45  33  11
C  45  33  10
D  55  35  60
D  57  34  61
E  66  36  13
E  67  38  14 
E  67  37  13

I want to get the minimum and maximum values of the categorical variable D across all the column values and so the output dataframe should look something like this:

D Xmin Xmax Ymin Ymax Zmin Zmax
A 20    22   16   17   21   23 
B 33    34   50   55   11   13    
C 44    45   33   34   10   11  
D 55    57   34   35   60   61
E 66    67   36   38   13   14

I have tried this, but no luck:

min_max_df = dfObj.groupby('D').agg({'X': [dfObj.min(axis=0), dfObj.max(axis=0)]})
Mob answered 13/11, 2021 at 0:22 Comment(0)
B
9

I believe this is a nice way of doing it and in a single line of code. Making use of join doing the operation by index and the rsuffix and lsuffix to differentiate min and max.

output = df.groupby('D').min().join(df.groupby('D').max(), lsuffix='min', rsuffix='max')

Outputs:

   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
D
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14
Belda answered 13/11, 2021 at 0:39 Comment(0)
P
13
df = df.groupby('D').agg(['min', 'max'])

Output:

>>> df
    X       Y       Z    
  min max min max min max
D                        
A  20  22  16  17  21  23
B  33  34  50  55  11  13
C  44  45  33  34  10  11
D  55  57  34  35  60  61
E  66  67  36  38  13  14

>>> df['X']['min']
D
A    20
B    33
C    44
D    55
E    66
Name: min, dtype: int64

You can flatten the columns as well:

df.columns = df.columns.map(''.join)
df.rename_axis(None)

   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14
Peyton answered 13/11, 2021 at 0:31 Comment(0)
D
10
from itertools import product
aggs = {f"{col}{fn}": (col, fn) for col,fn in product(['X', 'Y', 'Z'], ['min', 'max'])}
df.groupby('D').agg(**aggs)
>>>
   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
D
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14
Dentelle answered 13/11, 2021 at 0:30 Comment(3)
This is the fastest, per HarryPlotter's benchmarks.Peyton
@user17242583: Not per cycle. (Though it is not clear way the two kinds of number would not be consistent - due to some fixed overhead?)Dagger
Yeah actually, for an individual run it looks like mine (and Harry's) was faster.Peyton
B
9

I believe this is a nice way of doing it and in a single line of code. Making use of join doing the operation by index and the rsuffix and lsuffix to differentiate min and max.

output = df.groupby('D').min().join(df.groupby('D').max(), lsuffix='min', rsuffix='max')

Outputs:

   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
D
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14
Belda answered 13/11, 2021 at 0:39 Comment(0)
I
7
df = df.groupby('D').agg(['min', 'max'])
df.columns = df.columns.map("".join)  # flatten multi-level columns

Output

>>> df

   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
D                                    
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

Speed Test

I did some simple tests comparing Celius's, Asish's and user17242583's/my solutions.

Setup

import numpy as np 
import pandas as pd 
from itertools import product

n = 1_000_000
rng = np.random.default_rng() 

df = pd.DataFrame({
    'D': rng.choice(list("ABCDEFGH"), size=n),
    'X': rng.integers(100, size=n),
    'Y': rng.integers(100, size=n),
    'Z': rng.integers(100, size=n),
})

>>> df.shape
(1000000, 4)

Results

Here are the results.

# Ashish's 
>>> %%timeit -n 50
... aggs = {f"{col}{fn}": (col, fn) for col,fn in product(['X', 'Y', 'Z'], ['min', 'max'])}
... df1 = df.groupby('D').agg(**aggs) 

116 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)

# user17242583's / mine 
>>> %%timeit -n 50
... df1 = df.groupby('D').agg(['min', 'max'])
... df1.columns = df1.columns.map("".join)  # flat multi-level columns

120 ms ± 4.69 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)

#Celius's
>>> %%timeit -n 50
... df1 = df.groupby('D').min().join(df.groupby('D').max(), lsuffix='min', rsuffix='max')

178 ms ± 6.29 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)

Comparing the first two with the last one, we can conclude that using DataFrameGroupBy.agg is a little faster for large DataFrames.

Idel answered 13/11, 2021 at 1:2 Comment(7)
@Peyton Yes, but only for >~1M rows. For ~100k rows, the timings are practically identical ;) Lower than that, Celius's answer seems faster!Idel
I would have expected in no scenario for my answer to be the fastest (a join operation with two gruopings inside it) however indeed for <100k rows time is usually deprioritized over readability and applicability for a newer user. As long as bad habits are not taught (like using iterrows when you can have a vectorized solution) I try to aim for a one-liner that does the trick.Belda
@CeliusStingher yeah I understood, and I totally agree with you! I was just curious how big was the difference between using join and agg.Idel
Re "I did some simple tests": On what system (hardware (CPU type, clock frequency, number of cores, etc.), operating system), incl. version? Using what software and which versions/compiler flags (e.g., hardware acceleration)/compiler configuration (e.g. for Python and Pandas)? Under what conditions, e.g., cold (first time run) / warm. Please respond by editing (changing) your answer, not here in comments (without "Edit:", "Update:", or similar - the answer should appear as if it was written today).Dagger
Related: What makes a "good" performance question on Stack Overflow?Dagger
I think the answer is good enough for general knowledge. I am curious about something, while specifying all the details you mention might help create a more objective measurement, there is probably no way we can't match OP specs and most users have different specs too. Therefore all our measurements will be different from OPs but the most efficient solution vs the least efficient one might stand across most specs.Belda
@CeliusStingher Thank you, I completely agree. During many years of reading SO posts, I never saw someone providing those details in a simple performance test. The average user isn’t interested in them. The setup is precisely given so everyone (especially OP) can easily compare the timings in their own machine if they want. It shouldn’t be seen as a rigorous benchmark, as many more tests would be needed. The point was just to show that for large enough data, using agg instead of join is faster due to their implementation, which is something we're all aware of.Idel

© 2022 - 2024 — McMap. All rights reserved.