import numpy as np
a = np.array([0])
b = np.array([None])
c = np.array([''])
d = np.array([' '])
Why should we have this inconsistency:
>>> bool(a)
False
>>> bool(b)
False
>>> bool(c)
True
>>> bool(d)
False
import numpy as np
a = np.array([0])
b = np.array([None])
c = np.array([''])
d = np.array([' '])
Why should we have this inconsistency:
>>> bool(a)
False
>>> bool(b)
False
>>> bool(c)
True
>>> bool(d)
False
For arrays with one element, the array's truth value is determined by the truth value of that element.
The main point to make is that np.array([''])
is not an array containing one empty Python string. This array is created to hold strings of exactly one byte each and NumPy pads strings that are too short with the null character. This means that the array is equal to np.array(['\0'])
.
In this regard, NumPy is being consistent with Python which evaluates bool('\0')
as True
.
In fact, the only strings which are False
in NumPy arrays are strings which do not contain any non-whitespace characters ('\0'
is not a whitespace character).
Details of this Boolean evaluation are presented below.
Navigating NumPy's labyrinthine source code is not always easy, but we can find the code governing how values in different datatypes are mapped to Boolean values in the arraytypes.c.src file. This will explain how bool(a)
, bool(b)
, bool(c)
and bool(d)
are determined.
Before we get to the code in that file, we can see that calling bool()
on a NumPy array invokes the internal _array_nonzero()
function. If the array is empty, we get False
. If there are two or more elements we get an error. But if the array has exactly one element, we hit the line:
return PyArray_DESCR(mp)->f->nonzero(PyArray_DATA(mp), mp);
Now, PyArray_DESCR
is a struct holding various properties for the array. f
is a pointer to another struct PyArray_ArrFuncs
that holds the array's nonzero
function. In other words, NumPy is going to call upon the array's own special nonzero
function to check the Boolean value of that one element.
Determining whether an element is nonzero or not is obviously going to depend on the datatype of the element. The code implementing the type-specific nonzero functions can be found in the "nonzero" section of the arraytypes.c.src file.
As we'd expect, floats, integers and complex numbers are False
if they're equal with zero. This explains bool(a)
. In the case of object arrays, None
is similarly going to be evaluated as False
because NumPy just calls the PyObject_IsTrue
function. This explains bool(b)
.
To understand the results of bool(c)
and bool(d)
, we see that the nonzero
function for string type arrays is mapped to the STRING_nonzero
function:
static npy_bool
STRING_nonzero (char *ip, PyArrayObject *ap)
{
int len = PyArray_DESCR(ap)->elsize; // size of dtype (not string length)
int i;
npy_bool nonz = NPY_FALSE;
for (i = 0; i < len; i++) {
if (!Py_STRING_ISSPACE(*ip)) { // if it isn't whitespace, it's True
nonz = NPY_TRUE;
break;
}
ip++;
}
return nonz;
}
(The unicode case is more or less the same idea.)
So in arrays with a string or unicode datatype, a string is only False
if it contains only whitespace characters:
>>> bool(np.array([' ']))
False
In the case of array c
in the question, there is a really a null character \0
padding the seemingly-empty string:
>>> np.array(['']) == np.array(['\0'])
array([ True], dtype=bool)
The STRING_nonzero
function sees this non-whitespace character and so bool(c)
is True
.
As noted at the start of this answer, this is consistent with Python's evaluation of strings containing a single null character: bool('\0')
is also True
.
Update: Wim has fixed the behaviour detailed above in NumPy's master branch by making strings which contain only null characters, or a mix of only whitespace and null characters, evaluate to False
. This means that NumPy 1.10+ will see that bool(np.array(['']))
is False
, which is much more in line with Python's treatment of "empty" strings.
np.array([' '])
is false.. hilarious! –
Novena len
in this context is the value of elsize
, which is the size of the datatype. Both np.array([' '])
and np.array([' '])
are created with a dtype of '<U1'
on my machine. That is, it appears elsize
is greater than zero in both these cases, and the loop is still entered in the case of the empty string. –
Udale True
. If every character is a whitespace character then nonz
is not changed to True
and the function returns False
. –
Udale np.array(['\0']) == np.array([''])
evaluates to array([ True], dtype=bool)
for me –
Novena np.array([' '])
doesn't contain one - it's just used a placeholder character if empty strings are passed in. –
Udale x = ' \0'
then you get np.array([x])[0] != x
. In python you have len(x) == 2
but in a = np.array([x])
suddenly len(a[0]) == 1
. –
Novena ''
as true, I speculate that it's because Python treats bool('\0')
as true. There is no such thing as a truly empty string in an array, so NumPy is just being consistent with Python here. –
Udale a == b
and a == c
then b == c
, but here we have array(['']) == ''
and array([''] == '\0'
, with of course '' != '\0'
–
Novena array(['a'], dtype='S2') == array(['a'], dtype='S5')
. I guess your example highlights the need to be wary when comparing Python strings and string arrays because the transitivity of ==
may fail... –
Udale len(np.array([''])[0])
gives zero, so it's absurd to say that the string "isn't empty". The null byte in this "empty" string is not accessible in any way except for its awkward surfacing in this boolean context. It would make for more sense for the null terminating byte to be totally ignored for all computations. –
Monohydric 'S5'
, to have "length" less than 5. The reasonable way to do this is to have NumPy's string length function, np.char.str_len
, ignore any trailing null characters. Python's built in len
function doesn't do this for Python strings. Perhaps it would make more sense for the nonzero function for strings to be implemented in terms of np.char.str_len
rather than whitespace. –
Udale I'm pretty sure the answer is, as explained in Scalars, that:
Array scalars have the same attributes and methods as ndarrays. [1] This allows one to treat items of an array partly on the same footing as arrays, smoothing out rough edges that result when mixing scalar and array operations.
So, if it's acceptable to call bool
on a scalar, it must be acceptable to call bool
on an array of shape (1,)
, because they are, as far as possible, the same thing.
And, while it isn't directly said anywhere in the docs that I know of, it's pretty obvious from the design that NumPy's scalars are supposed to act like native Python objects.
So, that explains why np.array([0])
is falsey rather than truthy, which is what you were initially surprised about.
So, that explains the basics. But what about the specifics of case c
?
First, note that your array np.array([''])
is not an array of one Python object
, but an array of one NumPy <U1
null-terminated character string of length 1. Fixed-length-string values don't have the same truthiness rule as Python strings—and they really couldn't; for a fixed-length-string type, "false if empty" doesn't make any sense, because they're never empty. You could argument about whether NumPy should have been designed that way or not, but it clearly does follow that rule consistently, and I don't think the opposite rule would be any less confusing here, just different.
But there seems to be something else weird going on with strings. Consider this:
>>> np.array(['a', 'b']) != 0
True
That's not doing an elementwise comparison of the <U2
strings to 0 and returning array([True, True])
(as you'd get from np.array(['a', 'b'], dtype=object)
), it's doing an array-wide comparison and deciding that no array of strings is equal to 0, which seems odd… I'm not sure whether this deserves a separate answer here or even a whole separate question, but I am pretty sure I'm not going to be the one who writes that answer, because I have no clue what's going on here. :)
Beyond arrays of shape (1,)
, arrays of shape ()
are treated the same way, but anything else is a ValueError
, because otherwise it would be very easily to misuse arrays with and
and other Python operators that NumPy can't automagically convert into elementwise operations.
I personally think being consistent with other arrays would be more useful than being consistent with scalars here—in other words, just raise a ValueError
. I also think that, if being consistent with scalars were important here, it would be better to be consistent with the unboxed Python values. In other words, if bool(array([v]))
and bool(array(v))
are going to be allowed at all, they should always return exactly the same thing as bool(v)
, even if that's not consistent with np.nonzero
. But I can see the argument the other way.
'\0'
is a perfectly valid non-empty length-1 string, not the same empty string as ''
. –
Shapiro a ! =
and b != 0
produce array results, but c != 0
just produces the single value True
. I think numpy is doing some kind of type-comparison trickery which causes it to use a different rule for comparing string arrays than for other types. –
Monohydric dtype('<U1')
instead of np.str_
. Which is something I never actually do, so I'd have to look it up in the docs… –
Shapiro c != 0
. Maybe it's treating a string scalar as an array-like collection of its characters? –
Shapiro np.array(['a', 'a']) != 0
returns a single True
value rather than array([True, True])
as well, so you're definitely on to something… I'm just not sure yet what it is. –
Shapiro np.nonzero(x)
is (element-wise) true for numbers that aren't zero, and for everything that isn't a number. That's not quite the same as bool(x)
, which is (non-element-wise) true for numbers that aren't zero, and for non-empty collections, and for everything that isn't a number or collection. –
Shapiro np.nonzero()
is actually element-wise false for None
, which contradicts what you are describing, as far as I understand. In effect, I still don't understand NumPy's logic from your description (the last comment of your answer): it still seems strange, as highlighted in @wim's question. –
Shellfish None
in a NumPy array is with dtype object
; you can't put it in any of NumPy's own types. –
Shapiro ()
are always false" - are you sure? I just tried bool(numpy.array(3))
and got True
. –
Starch nonzero()
returns an empty array (no nonzero element) for None
, [None]
and array([None])
, then: since None
"isn't a number", I understand that you were saying that nonzero()
is "true" for None
, so it should indicate that it is non-zero, which is not the case (empty array returned by nonzero()
). If you still don't see what I mean, we can stop here, no problem: the discussion is long already. :) –
Shellfish object
dtype, NumPy's just defers to Python on whether something is truthy. For NumPy's own native types, it has its own rules. Do I need to edit that into the answer? –
Shapiro c
in the original post seems to imply. –
Shellfish == 0
or == '0'
can also influence whether you get ndarray or bool output –
Novena a == 0
and c == ''
both give you an ndarray. But a == '0'
and c == 0
both give you a bool. However, that idea went out the window because b
behaves the exact opposite. Sighs –
Novena It's fixed in master now.
I thought this was a bug, and the numpy
devs agreed, so this patch was merged earlier today. We should see new behaviour in the upcoming 1.10 release.
Numpy seems to be following the same castings as builtin python**, in this context it seems to be because of which return true for calls to nonzero
. Apparently len
can also be used, but here, none of these arrays are empty (length 0
) - so that's not directly relevant. Note that calling bool([False])
also returns True
according to these rules.
a = np.array([0])
b = np.array([None])
c = np.array([''])
>>> nonzero(a)
(array([], dtype=int64),)
>>> nonzero(b)
(array([], dtype=int64),)
>>> nonzero(c)
(array([0]),)
This also seems consistent with the more enumerative description of bool
casting --- where your examples are all explicitly discussed.
Interestingly, there does seem to be systematically different behavior with string arrays, e.g.
>>> a.astype(bool)
array([False], dtype=bool)
>>> b.astype(bool)
array([False], dtype=bool)
>>> c.astype(bool)
ERROR: ValueError: invalid literal for int() with base 10: ''
I think, when numpy converts something into a bool it uses the PyArray_BoolConverter
function which, in turn, just calls the PyObject_IsTrue
function --- i.e. the exact same function that builtin python uses, which is why numpy
's results are so consistent.
nonzero
instead of Python's bool
when the only point of letting (1,)-shape arrays respond to bool
is to let them transparently act like scalars. –
Shapiro bool([False])
may evaluate to True
, but so do bool([0])
, bool([None])
, and bool([''])
. The question is, why does numpy
treat the empty string differently from other falsey values in this context? –
Rosol bool()
) indicates that __nonzero__()
is used in this case (a little bit ambiguously, since __len__()
could also be used). You are using numpy.nonzero()
where you should be using __nonzero__()
. –
Shellfish nonzero
. There is a __nonzero__
magic method (in 2.x only…), but the function that calls it is bool
. What you've done is imported a somewhat-related but not actually-related-to-this-problem NumPy function whose name is confusing you. –
Shapiro __nonzero__
and NumPy's nonzero
, and it looks like a link that's relevant to the latter but is actually relevant to the former. –
Shapiro © 2022 - 2024 — McMap. All rights reserved.
ValueError
as any other array. But if they are going to do this, they should probably actually act like scalars and let Python use its normal rules (sobool(self[0])
). But maybe there's some good reason for this… – Shapiroand
andor
operators) would mean more to keep in your head. – Shapiro__nonzero__
handling. – Monohydric__nonzero__
vs. Python 3's__bool__
. It supports both in both versions, but in a clunky way that was initially broken in Python 3, and now is correct in both, but its clunkiness can still be exposed by trying to use the wrong language's magic method. – Shapiro