Pandas: get the min value between 2 dataframe columns
Asked Answered
P

1

9

I have 2 columns and I want a 3rd column to be the minimum value between them. My data looks like this:

   A  B
0  2  1
1  2  1
2  2  4
3  2  4
4  3  5
5  3  5
6  3  6
7  3  6

And I want to get a column C in the following way:

   A  B   C
0  2  1   1
1  2  1   1
2  2  4   2
3  2  4   2
4  3  5   3
5  3  5   3
6  3  6   3
7  3  6   3

Some helping code:

df = pd.DataFrame({'A': [2, 2, 2, 2, 3, 3, 3, 3],
                   'B': [1, 1, 4, 4, 5, 5, 6, 6]})

Thanks!

Philoctetes answered 12/4, 2019 at 14:39 Comment(1)
These would be min row values, not column values just for clarity.Louden
C
14

Use df.min(axis=1)

df['c'] = df.min(axis=1)
df
Out[41]: 
   A  B  c
0  2  1  1
1  2  1  1
2  2  4  2
3  2  4  2
4  3  5  3
5  3  5  3
6  3  6  3
7  3  6  3

This returns the min row-wise (when passing axis=1)

For non-heterogenous dtypes and large dfs you can use numpy.min which will be quicker:

In[42]:
df['c'] = np.min(df.values,axis=1)
df

Out[42]: 
   A  B  c
0  2  1  1
1  2  1  1
2  2  4  2
3  2  4  2
4  3  5  3
5  3  5  3
6  3  6  3
7  3  6  3

timings:

In[45]:
df = pd.DataFrame({'A': [2, 2, 2, 2, 3, 3, 3, 3],
                   'B': [1, 1, 4, 4, 5, 5, 6, 6]})
df = pd.concat([df]*1000, ignore_index=True)
df.shape

Out[45]: (8000, 2)

So for a 8K row df:

%timeit df.min(axis=1)
%timeit np.min(df.values,axis=1)
314 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
34.4 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

You can see that the numpy version is nearly 10x quicker (note I pass df.values so we pass a numpy array), this will become more of a factor when we get to even larger dfs

Note

for versions 0.24.0 or greater, use to_numpy()

so the above becomes:

df['c'] = np.min(df.to_numpy(),axis=1)

Timings:

%timeit df.min(axis=1)
%timeit np.min(df.values,axis=1)
%timeit np.min(df.to_numpy(),axis=1)
314 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
35.2 µs ± 680 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
35.5 µs ± 262 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

There is a minor discrepancy between .values and to_numpy(), it depends on whether you know upfront that the dtype is not mixed, and that the likely dtype is a factor e.g. float 16 vs float 32 see that link for further explanation. Pandas is doing a little more checking when calling to_numpy

Cassandry answered 12/4, 2019 at 14:41 Comment(2)
Perfect!. Thank you for the solution and the numpy.min suggestion. That is what I will implement as my df is large.Philoctetes
small note, with pandas 0.24.0 or higher, df.to_numpy() is preferred over df.valuesErrol

© 2022 - 2024 — McMap. All rights reserved.