Logical operators for Boolean indexing in Pandas

Asked 28/1, 2014 at 20:4 Answered 29/10, 2022 at 11:46

Solved python pandas dataframe boolean filtering

298

I'm working with a Boolean index in Pandas.

The question is why the statement:

a[(a['some_column']==some_number) & (a['some_other_column']==some_other_number)]

works fine whereas

a[(a['some_column']==some_number) and (a['some_other_column']==some_other_number)]

exits with error?

Example:

a = pd.DataFrame({'x':[1,1],'y':[10,20]})

In: a[(a['x']==1)&(a['y']==10)]
Out:    x   y
     0  1  10

In: a[(a['x']==1) and (a['y']==10)]
Out: ValueError: The truth value of an array with more than one element is ambiguous.     Use a.any() or a.all()

Woodenhead answered 28/1, 2014 at 20:4 Comment(3)

This is because numpy arrays and pandas series use the bitwise operators rather than logical as you are comparing every element in the array/series with another. It therefore does not make sense to use the logical operator in this situation. see related: #8632533 – Halifax 28/1, 2014 at 20:15

In Python and != &. The and operator in Python cannot be overridden, whereas the & operator (__and__) can. Hence the choice the use & in numpy and pandas. – Kythera 28/1, 2014 at 20:19

Related: Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() – Unclassical 8/3, 2019 at 22:10

342

When you say

(a['x']==1) and (a['y']==10)

You are implicitly asking Python to convert (a['x']==1) and (a['y']==10) to Boolean values.

NumPy arrays (of length greater than 1) and Pandas objects such as Series do not have a Boolean value -- in other words, they raise

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

when used as a Boolean value. That's because it's unclear when it should be True or False. Some users might assume they are True if they have non-zero length, like a Python list. Others might desire for it to be True only if all its elements are True. Others might want it to be True if any of its elements are True.

Because there are so many conflicting expectations, the designers of NumPy and Pandas refuse to guess, and instead raise a ValueError.

Instead, you must be explicit, by calling the empty(), all() or any() method to indicate which behavior you desire.

In this case, however, it looks like you do not want Boolean evaluation, you want element-wise logical-and. That is what the & binary operator performs:

(a['x']==1) & (a['y']==10)

returns a boolean array.

By the way, as alexpmil notes, the parentheses are mandatory since & has a higher operator precedence than ==.

Without the parentheses,

a['x']==1 & a['y']==10

would be evaluated as

a['x'] == (1 & a['y']) == 10

which would in turn be equivalent to the chained comparison

(a['x'] == (1 & a['y'])) and ((1 & a['y']) == 10)

That is an expression of the form Series and Series. The use of and with two Series would again trigger the same ValueError as above. That's why the parentheses are mandatory.

Tractable answered 28/1, 2014 at 20:22 Comment(6)

numpy arrays do have this property if they are length one. Only pandas devs (stubbornly) refuse to guess :p – Cassius 28/1, 2014 at 20:37

Doesn't '&' carry the same ambiguous curve as 'and'? How come when it comes to '&', suddenly all users all agree it should be element-wise, while when they see 'and', their expectations vary? – Sphery 15/4, 2016 at 21:17

@Indominus: The Python language itself requires that the expression x and y triggers the evaluation of bool(x) and bool(y). Python "first evaluates x; if x is false, its value is returned; otherwise, y is evaluated and the resulting value is returned." So the syntax x and y can not be used for element-wised logical-and since only x or y can be returned. In contrast, x & y triggers x.__and__(y) and the __and__ method can be defined to return anything we like. – Tractable 15/4, 2016 at 22:58

Important to note: the parentheses around the == clause are mandatory. a['x']==1 & a['y']==10 returns the same error as in the question. – Followthrough 18/7, 2017 at 18:41

What is " | " for? – Elva 24/1, 2018 at 9:2

@Elva | is the bitwise or operator. Python operator docs found here. – Norri 25/1, 2019 at 22:36

230

TLDR: Logical operators in Pandas are &, | and ~, and parentheses (...) are important!

Python's and, or and not logical operators are designed to work with scalars. So Pandas had to do one better and override the bitwise operators to achieve a vectorized (element-wise) version of this functionality.

So the following in Python (where exp1 and exp2 are expressions which evaluate to a boolean result)...

exp1 and exp2              # Logical AND
exp1 or exp2               # Logical OR
not exp1                   # Logical NOT

...will translate to...

exp1 & exp2                # Element-wise logical AND
exp1 | exp2                # Element-wise logical OR
~exp1                      # Element-wise logical NOT

for pandas.

If in the process of performing a logical operation you get a ValueError, then you need to use parentheses for grouping:

(exp1) op (exp2)

For example,

(df['col1'] == x) & (df['col2'] == y)

And so on.

Boolean Indexing: A common operation is to compute boolean masks through logical conditions to filter the data. Pandas provides three operators: & for logical AND, | for logical OR, and ~ for logical NOT.

Consider the following setup:

np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (5, 3)), columns=list('ABC'))
df

   A  B  C
0  5  0  3
1  3  7  9
2  3  5  2
3  4  7  6
4  8  8  1

Logical AND

For df above, say you'd like to return all rows where A < 5 and B > 5. This is done by computing masks for each condition separately, and ANDing them.

Overloaded Bitwise `&` Operator

Before continuing, please take note of this particular excerpt of the docs, which state

Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses, since by default Python will evaluate an expression such as df.A > 2 & df.B < 3 as df.A > (2 & df.B) < 3, while the desired evaluation order is (df.A > 2) & (df.B < 3).

So, with this in mind, element-wise logical AND can be implemented with the bitwise operator &:

df['A'] < 5

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'] > 5

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

(df['A'] < 5) & (df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

And the subsequent filtering step is simply,

df[(df['A'] < 5) & (df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

The parentheses are used to override the default operator precedence, where bitwise operators have higher precedence over the comparison operators (< and >).

If you do not use parentheses, the expression is evaluated incorrectly. For example, if you accidentally attempt something such as

df['A'] < 5 & df['B'] > 5

It is parsed as

df['A'] < (5 & df['B']) > 5

Which becomes,

df['A'] < something_you_dont_want > 5

Which becomes (see the python docs on chained operator comparison),

(df['A'] < something_you_dont_want) and (something_you_dont_want > 5)

Which becomes,

# Both operands are Series...
something_else_you_dont_want1 and something_else_you_dont_want2

Which throws

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

So, don't make that mistake! (I know I'm harping on this point, but please bear with me. This is a very, very common beginner's mistake, and must be explained very thoroughly.)

Avoiding Parentheses Grouping

The fix is actually quite simple. Most operators have a corresponding bound method for DataFrames. If the individual masks are built up using functions instead of conditional operators, you will no longer need to group by parens to specify evaluation order:

df['A'].lt(5)
 
0     True
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'].gt(5)

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

df['A'].lt(5) & df['B'].gt(5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

See the section on Flexible Comparisons. To summarise, we have

╒════╤════════════╤════════════╕
│    │ Operator   │ Function   │
╞════╪════════════╪════════════╡
│  0 │ >          │ gt         │
├────┼────────────┼────────────┤
│  1 │ >=         │ ge         │
├────┼────────────┼────────────┤
│  2 │ <          │ lt         │
├────┼────────────┼────────────┤
│  3 │ <=         │ le         │
├────┼────────────┼────────────┤
│  4 │ ==         │ eq         │
├────┼────────────┼────────────┤
│  5 │ !=         │ ne         │
╘════╧════════════╧════════════╛

Another option for avoiding parentheses is to use DataFrame.query (or eval):

df.query('A < 5 and B > 5')

   A  B  C
1  3  7  9
3  4  7  6

I have extensively documented query and eval in Dynamically evaluate an expression from a formula in Pandas.

`operator.and_`

Allows you to perform this operation in a functional manner. Internally calls Series.__and__ which corresponds to the bitwise operator.

import operator 

operator.and_(df['A'] < 5, df['B'] > 5)
# Same as,
# (df['A'] < 5).__and__(df['B'] > 5) 

0    False
1     True
2    False
3     True
4    False
dtype: bool

df[operator.and_(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

You won't usually need this, but it is useful to know.

Generalizing: `np.logical_and` (and `logical_and.reduce`)

Another alternative is using np.logical_and, which also does not need parentheses grouping:

np.logical_and(df['A'] < 5, df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
Name: A, dtype: bool

df[np.logical_and(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

np.logical_and is a ufunc (Universal Functions), and most ufuncs have a reduce method. This means it is easier to generalise with logical_and if you have multiple masks to AND. For example, to AND masks m1 and m2 and m3 with &, you would have to do

m1 & m2 & m3

However, an easier option is

np.logical_and.reduce([m1, m2, m3])

This is powerful, because it lets you build on top of this with more complex logic (for example, dynamically generating masks in a list comprehension and adding all of them):

import operator

cols = ['A', 'B']
ops = [np.less, np.greater]
values = [5, 5]

m = np.logical_and.reduce([op(df[c], v) for op, c, v in zip(ops, cols, values)])
m 
# array([False,  True, False,  True, False])

df[m]
   A  B  C
1  3  7  9
3  4  7  6

Logical OR

For the df above, say you'd like to return all rows where A == 3 or B == 7.

Overloaded Bitwise `|`

df['A'] == 3

0    False
1     True
2     True
3    False
4    False
Name: A, dtype: bool

df['B'] == 7
 
0    False
1     True
2    False
3     True
4    False
Name: B, dtype: bool

(df['A'] == 3) | (df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[(df['A'] == 3) | (df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

If you haven't yet, please also read the section on Logical AND above, all caveats apply here.

Alternatively, this operation can be specified with

df[df['A'].eq(3) | df['B'].eq(7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

`operator.or_`

Calls Series.__or__ under the hood.

operator.or_(df['A'] == 3, df['B'] == 7)
# Same as,
# (df['A'] == 3).__or__(df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[operator.or_(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

`np.logical_or`

For two conditions, use logical_or:

np.logical_or(df['A'] == 3, df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df[np.logical_or(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

For multiple masks, use logical_or.reduce:

np.logical_or.reduce([df['A'] == 3, df['B'] == 7])
# array([False,  True,  True,  True, False])

df[np.logical_or.reduce([df['A'] == 3, df['B'] == 7])]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

Logical NOT

Given a mask, such as

mask = pd.Series([True, True, False])

If you need to invert every boolean value (so that the end result is [False, False, True]), then you can use any of the methods below.

Bitwise `~`

~mask

0    False
1    False
2     True
dtype: bool

Again, expressions need to be parenthesised.

~(df['A'] == 3)

0     True
1    False
2    False
3     True
4     True
Name: A, dtype: bool

This internally calls mask.__invert__(), but don't use it directly.

`operator.inv`

Internally calls __invert__ on the Series.

operator.inv(mask)

0    False
1    False
2     True
dtype: bool

`np.logical_not`

This is the numpy variant.

np.logical_not(mask)

0    False
1    False
2     True
dtype: bool

Note, np.logical_and can be substituted for np.bitwise_and, logical_or with bitwise_or, and logical_not with invert.

Unclassical answered 25/1, 2019 at 2:53 Comment(13)

@ cs95 in the TLDR, for element-wise boolean OR, you advocate using |, which is equivalent to numpy.bitwise_or, instead of numpy.logical_or. May I ask why? Isn't numpy.logical_or designed for this task specifically? Why add the burden of doing it bitwise for each pair of elements? – Incarnadine 13/6, 2019 at 21:40

@Incarnadine can you quote the relevant text please? I cannot find what you're referring to. FWIW I maintain that logical_* is the correct functional equivalent of the operators. – Unclassical 13/6, 2019 at 21:50

@ cs95 I am referring to the first line of the Answer: "TLDR; Logical Operators in Pandas are &, | and ~". – Incarnadine 13/6, 2019 at 21:59

@Incarnadine It is literally in the documentation: "Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not." – Unclassical 13/6, 2019 at 22:5

@ cs95, ok, I just read this section, and it does use | for element-wise boolean operation. But to me, that documentation is more of a "tutorial", and in contrast, I feel these API references are closer to the source of truth: numpy.bitwise_or and numpy.logical_or - so I'm trying to make sense of what is described here. – Incarnadine 14/6, 2019 at 0:11

What's unclear to me is this: In the first numpy doc, it is mentioned numpy.bitwise_or is equivalent to |. But they don't say numpy.bitwise_or is functionally equivalent to numpy.logical_or. So how can we be sure they are? The former is a bitwise operation so doesn't it depend on NumPy's binary representation of the Boolean values? – Incarnadine 14/6, 2019 at 0:11

@Incarnadine the main difference between logical and bitwise operations is the short circuiting property (bitwise operators do not short circuit). So in that respect, they are not equivalent. However they do produce the same output for boolean masks. – Unclassical 14/6, 2019 at 1:33

@ cs95 but a boolean value (one element in the boolean mask) is not encoded/represented as 1 bit, or is it? If it's not 1 bit, then a bitwise operation may produce different output than a logical operation. – Incarnadine 14/6, 2019 at 19:5

@Incarnadine A bool is represented by an 8 bit number but uses only 1 bit. This is well understood. – Unclassical 14/6, 2019 at 19:9

@ cs95 And the bitwise boolean operations only operate on the 1 bit of the 8? Would you have a reference on this fact? – Incarnadine 15/6, 2019 at 3:2

For example, doing a bitwise not operation would normally invert all the bits, including the other 7. – Incarnadine 15/6, 2019 at 3:17

@Incarnadine it won't if the array is dtype bool. Otherwise, you're right. Try it out: np.bitwise_not([False]) versus np.bitwise_not(np.array([False], dtype=object)) – Unclassical 15/6, 2019 at 3:50

@ cs95 that's interesting. I also tested with bitwise_xor, and it seems these bitwise operators do not blindly work on all bits - it checks the type, as you mentioned; if it's np._bool, it's "smart" enough to know to operate only the meaningful bit. So back to my original point: I now see for Boolean element-wise operations, | and numpy.bitwise_or are equivalent to numpy.logical_or, and | is probably preferred due to succinctness. – Incarnadine 15/6, 2019 at 20:7

Logical operators for boolean indexing in Pandas

It's important to realize that you cannot use any of the Python logical operators (and, or or not) on pandas.Series or pandas.DataFrames (similarly you cannot use them on numpy.arrays with more than one element). The reason why you cannot use those is because they implicitly call bool on their operands which throws an Exception because these data structures decided that the boolean of an array is ambiguous:

>>> import numpy as np
>>> import pandas as pd
>>> arr = np.array([1,2,3])
>>> s = pd.Series([1,2,3])
>>> df = pd.DataFrame([1,2,3])
>>> bool(arr)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> bool(s)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> bool(df)
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I did cover this more extensively in my answer to the "Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()" Q+A.

NumPy’s logical functions

However NumPy provides element-wise operating equivalents to these operators as functions that can be used on numpy.array, pandas.Series, pandas.DataFrame, or any other (conforming) numpy.array subclass:

and has np.logical_and
or has np.logical_or
not has np.logical_not
numpy.logical_xor which has no Python equivalent, but it is a logical "exclusive or" operation

So, essentially, one should use (assuming df1 and df2 are Pandas DataFrames):

np.logical_and(df1, df2)
np.logical_or(df1, df2)
np.logical_not(df1)
np.logical_xor(df1, df2)

Bitwise functions and bitwise operators for Booleans

However in case you have boolean NumPy array, Pandas Series, or Pandas DataFrames you could also use the element-wise bitwise functions (for booleans they are - or at least should be - indistinguishable from the logical functions):

bitwise and: np.bitwise_and or the & operator
bitwise or: np.bitwise_or or the | operator
bitwise not: np.invert (or the alias np.bitwise_not) or the ~ operator
bitwise xor: np.bitwise_xor or the ^ operator

Typically the operators are used. However when combined with comparison operators one has to remember to wrap the comparison in parenthesis because the bitwise operators have a higher precedence than the comparison operators:

(df1 < 10) | (df2 > 10)  # instead of the wrong df1 < 10 | df2 > 10

This may be irritating because the Python logical operators have a lower precedence than the comparison operators, so you normally write a < 10 and b > 10 (where a and b are for example simple integers) and don't need the parenthesis.

Differences between logical and bitwise operations (on non-booleans)

It is really important to stress that bit and logical operations are only equivalent for Boolean NumPy arrays (and boolean Series & DataFrames). If these don't contain Booleans then the operations will give different results. I'll include examples using NumPy arrays, but the results will be similar for the pandas data structures:

>>> import numpy as np
>>> a1 = np.array([0, 0, 1, 1])
>>> a2 = np.array([0, 1, 0, 1])

>>> np.logical_and(a1, a2)
array([False, False, False,  True])
>>> np.bitwise_and(a1, a2)
array([0, 0, 0, 1], dtype=int32)

And since NumPy (and similarly Pandas) does different things for Boolean (Boolean or “mask” index arrays) and integer (Index arrays) indices the results of indexing will be also be different:

>>> a3 = np.array([1, 2, 3, 4])

>>> a3[np.logical_and(a1, a2)]
array([4])
>>> a3[np.bitwise_and(a1, a2)]
array([1, 1, 1, 2])

Summary table

Logical operator | NumPy logical function | NumPy bitwise function | Bitwise operator
-------------------------------------------------------------------------------------
       and       |  np.logical_and        | np.bitwise_and         |        &
-------------------------------------------------------------------------------------
       or        |  np.logical_or         | np.bitwise_or          |        |
-------------------------------------------------------------------------------------
                 |  np.logical_xor        | np.bitwise_xor         |        ^
-------------------------------------------------------------------------------------
       not       |  np.logical_not        | np.invert              |        ~

Where the logical operator does not work for NumPy arrays, Pandas Series, and pandas DataFrames. The others work on these data structures (and plain Python objects) and work element-wise. However, be careful with the bitwise invert on plain Python bools because the bool will be interpreted as integers in this context (for example ~False returns -1 and ~True returns -2).

Bim answered 25/1, 2019 at 21:48 Comment(0)

Note that you can also use * to do and:

   In [12]: np.all([a > 20, a < 40], axis=0)
   Out[12]:
   array([[False,  True, False, False,  True],
          [False, False, False, False, False],
          [ True,  True, False, False, False],
          [False,  True, False, False, False],
          [False,  True, False, False, False]])

   In [13]: (a > 20) * (a < 40)
   Out[13]:
   array([[False,  True, False, False,  True],
          [False, False, False, False, False],
          [ True,  True, False, False, False],
          [False,  True, False, False, False],
          [False,  True, False, False, False]])

I'm not claiming this is better than using np.all or |. But it does work.

Shantung answered 29/10, 2022 at 11:46 Comment(1)

Also + for or operation. In pandas series that's useful for avoiding parenthesis – Elli 22/11, 2022 at 15:54

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Logical AND

Overloaded Bitwise & Operator

Avoiding Parentheses Grouping

operator.and_

Generalizing: np.logical_and (and logical_and.reduce)

Logical OR

Overloaded Bitwise |

operator.or_

np.logical_or