Get min and max values of categorical variable in a dataframe

Asked 13/11, 2021 at 0:22 Answered 13/11, 2021 at 1:2

I have a dataframe that looks like this:

D   X   Y   Z
A  22  16  23
A  21  16  22
A  20  17  21
B  33  50  11
B  34  53  12
B  34  55  13
C  44  34  11
C  45  33  11
C  45  33  10
D  55  35  60
D  57  34  61
E  66  36  13
E  67  38  14 
E  67  37  13

I want to get the minimum and maximum values of the categorical variable D across all the column values and so the output dataframe should look something like this:

D Xmin Xmax Ymin Ymax Zmin Zmax
A 20    22   16   17   21   23 
B 33    34   50   55   11   13    
C 44    45   33   34   10   11  
D 55    57   34   35   60   61
E 66    67   36   38   13   14

I have tried this, but no luck:

min_max_df = dfObj.groupby('D').agg({'X': [dfObj.min(axis=0), dfObj.max(axis=0)]})

Mob answered 13/11, 2021 at 0:22 Comment(0)

I believe this is a nice way of doing it and in a single line of code. Making use of join doing the operation by index and the rsuffix and lsuffix to differentiate min and max.

output = df.groupby('D').min().join(df.groupby('D').max(), lsuffix='min', rsuffix='max')

Outputs:

   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
D
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

Belda answered 13/11, 2021 at 0:39 Comment(0)

df = df.groupby('D').agg(['min', 'max'])

Output:

>>> df
    X       Y       Z    
  min max min max min max
D                        
A  20  22  16  17  21  23
B  33  34  50  55  11  13
C  44  45  33  34  10  11
D  55  57  34  35  60  61
E  66  67  36  38  13  14

>>> df['X']['min']
D
A    20
B    33
C    44
D    55
E    66
Name: min, dtype: int64

You can flatten the columns as well:

df.columns = df.columns.map(''.join)
df.rename_axis(None)

   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

Peyton answered 13/11, 2021 at 0:31 Comment(0)

from itertools import product
aggs = {f"{col}{fn}": (col, fn) for col,fn in product(['X', 'Y', 'Z'], ['min', 'max'])}
df.groupby('D').agg(**aggs)
>>>
   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
D
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

Dentelle answered 13/11, 2021 at 0:30 Comment(3)

This is the fastest, per HarryPlotter's benchmarks. – Peyton 13/11, 2021 at 15:5

@user17242583: Not per cycle. (Though it is not clear way the two kinds of number would not be consistent - due to some fixed overhead?) – Dagger 14/11, 2021 at 7:36

Yeah actually, for an individual run it looks like mine (and Harry's) was faster. – Peyton 14/11, 2021 at 15:9

I believe this is a nice way of doing it and in a single line of code. Making use of join doing the operation by index and the rsuffix and lsuffix to differentiate min and max.

output = df.groupby('D').min().join(df.groupby('D').max(), lsuffix='min', rsuffix='max')

Outputs:

   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
D
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

Belda answered 13/11, 2021 at 0:39 Comment(0)

df = df.groupby('D').agg(['min', 'max'])
df.columns = df.columns.map("".join)  # flatten multi-level columns

Output

>>> df

   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
D                                    
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

Speed Test

I did some simple tests comparing Celius's, Asish's and user17242583's/my solutions.

Setup

import numpy as np 
import pandas as pd 
from itertools import product

n = 1_000_000
rng = np.random.default_rng() 

df = pd.DataFrame({
    'D': rng.choice(list("ABCDEFGH"), size=n),
    'X': rng.integers(100, size=n),
    'Y': rng.integers(100, size=n),
    'Z': rng.integers(100, size=n),
})

>>> df.shape
(1000000, 4)

Results

Here are the results.

# Ashish's 
>>> %%timeit -n 50
... aggs = {f"{col}{fn}": (col, fn) for col,fn in product(['X', 'Y', 'Z'], ['min', 'max'])}
... df1 = df.groupby('D').agg(**aggs) 

116 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)

# user17242583's / mine 
>>> %%timeit -n 50
... df1 = df.groupby('D').agg(['min', 'max'])
... df1.columns = df1.columns.map("".join)  # flat multi-level columns

120 ms ± 4.69 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)

#Celius's
>>> %%timeit -n 50
... df1 = df.groupby('D').min().join(df.groupby('D').max(), lsuffix='min', rsuffix='max')

178 ms ± 6.29 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)

Comparing the first two with the last one, we can conclude that using DataFrameGroupBy.agg is a little faster for large DataFrames.

Idel answered 13/11, 2021 at 1:2 Comment(7)

@Peyton Yes, but only for >~1M rows. For ~100k rows, the timings are practically identical ;) Lower than that, Celius's answer seems faster! – Idel 13/11, 2021 at 1:47

I would have expected in no scenario for my answer to be the fastest (a join operation with two gruopings inside it) however indeed for <100k rows time is usually deprioritized over readability and applicability for a newer user. As long as bad habits are not taught (like using iterrows when you can have a vectorized solution) I try to aim for a one-liner that does the trick. – Belda 13/11, 2021 at 1:49

@CeliusStingher yeah I understood, and I totally agree with you! I was just curious how big was the difference between using join and agg. – Idel 13/11, 2021 at 1:51

Re "I did some simple tests": On what system (hardware (CPU type, clock frequency, number of cores, etc.), operating system), incl. version? Using what software and which versions/compiler flags (e.g., hardware acceleration)/compiler configuration (e.g. for Python and Pandas)? Under what conditions, e.g., cold (first time run) / warm. Please respond by editing (changing) your answer, not here in comments (without "Edit:", "Update:", or similar - the answer should appear as if it was written today). – Dagger 14/11, 2021 at 7:24

Related: What makes a "good" performance question on Stack Overflow? – Dagger 14/11, 2021 at 7:33

I think the answer is good enough for general knowledge. I am curious about something, while specifying all the details you mention might help create a more objective measurement, there is probably no way we can't match OP specs and most users have different specs too. Therefore all our measurements will be different from OPs but the most efficient solution vs the least efficient one might stand across most specs. – Belda 14/11, 2021 at 13:8

@CeliusStingher Thank you, I completely agree. During many years of reading SO posts, I never saw someone providing those details in a simple performance test. The average user isn’t interested in them. The setup is precisely given so everyone (especially OP) can easily compare the timings in their own machine if they want. It shouldn’t be seen as a rigorous benchmark, as many more tests would be needed. The point was just to show that for large enough data, using agg instead of join is faster due to their implementation, which is something we're all aware of. – Idel 14/11, 2021 at 15:2

Speed Test

Recommended topics

Hot tags