TLDR: Logical operators in Pandas are &
, |
and ~
, and parentheses (...)
are important!
Python's and
, or
and not
logical operators are designed to work with scalars. So Pandas had to do one better and override the bitwise operators to achieve a vectorized (element-wise) version of this functionality.
So the following in Python (where exp1
and exp2
are expressions which evaluate to a boolean result)...
exp1 and exp2 # Logical AND
exp1 or exp2 # Logical OR
not exp1 # Logical NOT
...will translate to...
exp1 & exp2 # Element-wise logical AND
exp1 | exp2 # Element-wise logical OR
~exp1 # Element-wise logical NOT
for pandas.
If in the process of performing a logical operation you get a ValueError
, then you need to use parentheses for grouping:
(exp1) op (exp2)
For example,
(df['col1'] == x) & (df['col2'] == y)
And so on.
Boolean Indexing: A common operation is to compute boolean masks through logical conditions to filter the data. Pandas provides three operators: &
for logical AND, |
for logical OR, and ~
for logical NOT.
Consider the following setup:
np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (5, 3)), columns=list('ABC'))
df
A B C
0 5 0 3
1 3 7 9
2 3 5 2
3 4 7 6
4 8 8 1
Logical AND
For df
above, say you'd like to return all rows where A < 5 and B > 5. This is done by computing masks for each condition separately, and ANDing them.
Overloaded Bitwise &
Operator
Before continuing, please take note of this particular excerpt of the docs, which state
Another common operation is the use of boolean vectors to filter the
data. The operators are: |
for or
, &
for and
, and ~
for not
. These
must be grouped by using parentheses, since by default Python will
evaluate an expression such as df.A > 2 & df.B < 3
as df.A > (2 & df.B) < 3
, while the desired evaluation order is (df.A > 2) & (df.B < 3)
.
So, with this in mind, element-wise logical AND can be implemented with the bitwise operator &
:
df['A'] < 5
0 False
1 True
2 True
3 True
4 False
Name: A, dtype: bool
df['B'] > 5
0 False
1 True
2 False
3 True
4 True
Name: B, dtype: bool
(df['A'] < 5) & (df['B'] > 5)
0 False
1 True
2 False
3 True
4 False
dtype: bool
And the subsequent filtering step is simply,
df[(df['A'] < 5) & (df['B'] > 5)]
A B C
1 3 7 9
3 4 7 6
The parentheses are used to override the default operator precedence, where bitwise operators have higher precedence over the comparison operators (<
and >
).
If you do not use parentheses, the expression is evaluated incorrectly. For example, if you accidentally attempt something such as
df['A'] < 5 & df['B'] > 5
It is parsed as
df['A'] < (5 & df['B']) > 5
Which becomes,
df['A'] < something_you_dont_want > 5
Which becomes (see the python docs on chained operator comparison),
(df['A'] < something_you_dont_want) and (something_you_dont_want > 5)
Which becomes,
# Both operands are Series...
something_else_you_dont_want1 and something_else_you_dont_want2
Which throws
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
So, don't make that mistake! (I know I'm harping on this point, but please bear with me. This is a very, very common beginner's mistake, and must be explained very thoroughly.)
Avoiding Parentheses Grouping
The fix is actually quite simple. Most operators have a corresponding bound method for DataFrames. If the individual masks are built up using functions instead of conditional operators, you will no longer need to group by parens to specify evaluation order:
df['A'].lt(5)
0 True
1 True
2 True
3 True
4 False
Name: A, dtype: bool
df['B'].gt(5)
0 False
1 True
2 False
3 True
4 True
Name: B, dtype: bool
df['A'].lt(5) & df['B'].gt(5)
0 False
1 True
2 False
3 True
4 False
dtype: bool
See the section on Flexible Comparisons. To summarise, we have
╒════╤════════════╤════════════╕
│ │ Operator │ Function │
╞════╪════════════╪════════════╡
│ 0 │ > │ gt │
├────┼────────────┼────────────┤
│ 1 │ >= │ ge │
├────┼────────────┼────────────┤
│ 2 │ < │ lt │
├────┼────────────┼────────────┤
│ 3 │ <= │ le │
├────┼────────────┼────────────┤
│ 4 │ == │ eq │
├────┼────────────┼────────────┤
│ 5 │ != │ ne │
╘════╧════════════╧════════════╛
Another option for avoiding parentheses is to use DataFrame.query
(or eval
):
df.query('A < 5 and B > 5')
A B C
1 3 7 9
3 4 7 6
I have extensively documented query
and eval
in Dynamically evaluate an expression from a formula in Pandas.
Allows you to perform this operation in a functional manner. Internally calls Series.__and__
which corresponds to the bitwise operator.
import operator
operator.and_(df['A'] < 5, df['B'] > 5)
# Same as,
# (df['A'] < 5).__and__(df['B'] > 5)
0 False
1 True
2 False
3 True
4 False
dtype: bool
df[operator.and_(df['A'] < 5, df['B'] > 5)]
A B C
1 3 7 9
3 4 7 6
You won't usually need this, but it is useful to know.
Generalizing: np.logical_and
(and logical_and.reduce
)
Another alternative is using np.logical_and
, which also does not need parentheses grouping:
np.logical_and(df['A'] < 5, df['B'] > 5)
0 False
1 True
2 False
3 True
4 False
Name: A, dtype: bool
df[np.logical_and(df['A'] < 5, df['B'] > 5)]
A B C
1 3 7 9
3 4 7 6
np.logical_and
is a ufunc (Universal Functions), and most ufuncs have a reduce
method. This means it is easier to generalise with logical_and
if you have multiple masks to AND. For example, to AND masks m1
and m2
and m3
with &
, you would have to do
m1 & m2 & m3
However, an easier option is
np.logical_and.reduce([m1, m2, m3])
This is powerful, because it lets you build on top of this with more complex logic (for example, dynamically generating masks in a list comprehension and adding all of them):
import operator
cols = ['A', 'B']
ops = [np.less, np.greater]
values = [5, 5]
m = np.logical_and.reduce([op(df[c], v) for op, c, v in zip(ops, cols, values)])
m
# array([False, True, False, True, False])
df[m]
A B C
1 3 7 9
3 4 7 6
Logical OR
For the df
above, say you'd like to return all rows where A == 3 or B == 7.
Overloaded Bitwise |
df['A'] == 3
0 False
1 True
2 True
3 False
4 False
Name: A, dtype: bool
df['B'] == 7
0 False
1 True
2 False
3 True
4 False
Name: B, dtype: bool
(df['A'] == 3) | (df['B'] == 7)
0 False
1 True
2 True
3 True
4 False
dtype: bool
df[(df['A'] == 3) | (df['B'] == 7)]
A B C
1 3 7 9
2 3 5 2
3 4 7 6
If you haven't yet, please also read the section on Logical AND above, all caveats apply here.
Alternatively, this operation can be specified with
df[df['A'].eq(3) | df['B'].eq(7)]
A B C
1 3 7 9
2 3 5 2
3 4 7 6
Calls Series.__or__
under the hood.
operator.or_(df['A'] == 3, df['B'] == 7)
# Same as,
# (df['A'] == 3).__or__(df['B'] == 7)
0 False
1 True
2 True
3 True
4 False
dtype: bool
df[operator.or_(df['A'] == 3, df['B'] == 7)]
A B C
1 3 7 9
2 3 5 2
3 4 7 6
For two conditions, use logical_or
:
np.logical_or(df['A'] == 3, df['B'] == 7)
0 False
1 True
2 True
3 True
4 False
Name: A, dtype: bool
df[np.logical_or(df['A'] == 3, df['B'] == 7)]
A B C
1 3 7 9
2 3 5 2
3 4 7 6
For multiple masks, use logical_or.reduce
:
np.logical_or.reduce([df['A'] == 3, df['B'] == 7])
# array([False, True, True, True, False])
df[np.logical_or.reduce([df['A'] == 3, df['B'] == 7])]
A B C
1 3 7 9
2 3 5 2
3 4 7 6
Logical NOT
Given a mask, such as
mask = pd.Series([True, True, False])
If you need to invert every boolean value (so that the end result is [False, False, True]
), then you can use any of the methods below.
Bitwise ~
~mask
0 False
1 False
2 True
dtype: bool
Again, expressions need to be parenthesised.
~(df['A'] == 3)
0 True
1 False
2 False
3 True
4 True
Name: A, dtype: bool
This internally calls mask.__invert__()
, but don't use it directly.
Internally calls __invert__
on the Series.
operator.inv(mask)
0 False
1 False
2 True
dtype: bool
This is the numpy variant.
np.logical_not(mask)
0 False
1 False
2 True
dtype: bool
Note, np.logical_and
can be substituted for np.bitwise_and
, logical_or
with bitwise_or
, and logical_not
with invert
.
and != &
. Theand
operator in Python cannot be overridden, whereas the&
operator (__and__
) can. Hence the choice the use&
in numpy and pandas. – Kythera