'is' operator behaves differently when comparing strings with spaces
Asked Answered
F

6

31

I've started learning Python (python 3.3) and I was trying out the is operator. I tried this:

>>> b = 'is it the space?'
>>> a = 'is it the space?'
>>> a is b
False
>>> c = 'isitthespace'
>>> d = 'isitthespace'
>>> c is d
True
>>> e = 'isitthespace?'
>>> f = 'isitthespace?'
>>> e is f
False

It seems like the space and the question mark make the is behave differently. What's going on?

EDIT: I know I should be using ==, I just wanted to know why is behaves like this.

Faro answered 26/5, 2013 at 6:9 Comment(7)
For the record you should be using == to compare any item for equality but this is an interesting question nonethelessFluff
Probably some kind of string interning is causing a is b (noticing the string constant assigned to b has already been created and re-using it). The interning rule must care about spaces (or possibly length)Affirm
Hmm... I have different results while using file instead of writing in interpreter. The same in ideone.Midrib
For whatever reason id('ab') consistently returns the same value in my shell while id('a ') consistently changes. I still have no idea why letters would have different behavior, but it's interesting to observe. Perhaps Python makes some kind of optimization by assuming that strings will often contain letters? I don't think that would make much sense but it's hard to explain this behavior. This is an interesting question.Aftonag
I would still like to see a definitive answer to this regarding CPythonFluff
As you already know about what is really does, maybe this question would be helpful - if it contained a useful answer.Notwithstanding
read this https://mcmap.net/q/37596/-why-0-6-is-6-false-duplicatePatency
A
26

Warning: this answer is about the implementation details of a specific python interpreter. comparing strings with is==bad idea.

Well, at least for cpython3.4/2.7.3, the answer is "no, it is not the whitespace". Not only the whitespace:

  • Two string literals will share memory if they are either alphanumeric or reside on the same block (file, function, class or single interpreter command)

  • An expression that evaluates to a string will result in an object that is identical to the one created using a string literal, if and only if it is created using constants and binary/unary operators, and the resulting string is shorter than 21 characters.

  • Single characters are unique.

Examples

Alphanumeric string literals always share memory:

>>> x='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
>>> y='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
>>> x is y
True

Non-alphanumeric string literals share memory if and only if they share the enclosing syntactic block:

(interpreter)

>>> x='`!@#$%^&*() \][=-. >:"?<a'; y='`!@#$%^&*() \][=-. >:"?<a';
>>> z='`!@#$%^&*() \][=-. >:"?<a';
>>> x is y
True 
>>> x is z
False 

(file)

x='`!@#$%^&*() \][=-. >:"?<a';
y='`!@#$%^&*() \][=-. >:"?<a';
z=(lambda : '`!@#$%^&*() \][=-. >:"?<a')()
print(x is y)
print(x is z)

Output: True and False

For simple binary operations, the compiler is doing very simple constant propagation (see peephole.c), but with strings it does so only if the resulting string is shorter than 21 charcters. If this is the case, the rules mentioned earlier are in force:

>>> 'a'*10+'a'*10 is 'a'*20
True
>>> 'a'*21 is 'a'*21
False
>>> 'aaaaaaaaaaaaaaaaaaaaa' is 'aaaaaaaa' + 'aaaaaaaaaaaaa'
False
>>> t=2; 'a'*t is 'aa'
False
>>> 'a'.__add__('a') is 'aa'
False
>>> x='a' ; x+='a'; x is 'aa'
False

Single characters always share memory, of course:

>>> chr(0x20) is ' '
True
Aloin answered 26/5, 2013 at 6:9 Comment(12)
Probably then the strings are not interned as I thought before, but taken - as string literals - from the same pool of strings inside a module.Notwithstanding
It is possible that when you hard-code 'a'*20 (or addition of string literals) the interpreter makes an optimizing decision and replaces it with the resulting string 'aaaaaaaaaaaaaaaaaaaa', so no string manipulation is done in runtime, (faster execution, but larger compiled code); but when the multiplication is too big the optimization does not activate, in to keep the code size small.Anastigmatic
@lftah dis.dis(lambda: 'x'*20) results in LOAD_CONST 3 ('xxxxxxxxxxxxxxxxxxxx'), dis.dis(lambda: 'aaaaa' + "aaa") in LOAD_CONST 3 ('aaaaaaaa'). I could imagine that a method call (e. g. .join() is "too complicated" in that sense.Notwithstanding
BTW, dis.dis(lambda: 'x'*21) leads to LOAD_CONST 1 ('x') LOAD_CONST 2 (21) BINARY_MULTIPLYNotwithstanding
Seems like the compiler is doing simple constant propagation, including '*' or '+' if the result is less than 21 characters.Aloin
What you show is only how this specific implementation behaves. In other words, it is only implementational detail, optimization, and it should not be considered the language feature. Think about the situation when the algorithm is to be distributed. The operator is should not be misused only because of the features of the implementation.Escarpment
@pepr, we've been through this already. The implementation details are what is interesting here. We do not try to teach how to use python in the correct way right now.Aloin
@Elazar: I see. Then, isn't it better to look inside the implementation?Escarpment
@Escarpment you are more than welcome to look inside it and tell us. I found it easier for me to knock on the walls; but by all means - go ahead. Looking inside the implementation is indeed the right way to do it.Aloin
+1 great answer, In case someone is using Ipython shell then x='`!@#$%^&*() \][=-. >:"?<a'; y='`!@#$%^&*() \][=-. >:"?<a'; will return False.Flake
@Elazar: Great answer +1 :) Where did you find these informations. In which source file is this implemented?Playful
Thanks. There's a link in the answer. But generally I just played with the REPL.Aloin
N
16

To expand on Ignacio’s answer a bit: The is operator is the identity operator. It is used to compare object identity. If you construct two objects with the same contents, then it is usually not the case that the object identity yields true. It works for some small strings because CPython, the reference implementation of Python, stores the contents separately, making all those objects reference to the same string content. So the is operator returns true for those.

This however is an implementation detail of CPython and is generally neither guaranteed for CPython nor any other implementation. So using this fact is a bad idea as it can break any other day.

To compare strings, you use the == operator which compares the equality of objects. Two string objects are considered equal when they contain the same characters. So this is the correct operator to use when comparing strings, and is should be generally avoided if you do not explicitely want object identity (example: a is False).


If you are really interested in the details, you can find the implementation of CPython’s strings here. But again: This is implementation detail, so you should never require this to work.

Nosy answered 26/5, 2013 at 6:18 Comment(11)
Anyone reading this using python 2.x should also be aware that they should not use a is False, as booleans aren't singletons.Caracara
@Caracara not even python3 should do thatFluff
a is False makes no sense. The correct spelling is not a.Unprecedented
a is False is beautiful english. Apparently it is too good to be True :)Aloin
@LennartRegebro - you might want to compare two boolean expressions for equality though (in which case, == is correct).Lorelle
@sapi: how about assuming Python 2.5 or later? My belief is that assuming this minimum version they are "singletons" (not the right word, but it'll do); is that really not so? Can you give an example?Resurrection
@ChrisMorgan - type False = True into the shell; that's why comparing to True or False at all doesn't make any sense.Caracara
Ah yes, I suppose so---but that doesn't, of itself, mean is True or is False is a bad idea. If anyone does things like assigning to the names False or True they should expect things to break. If identity comparison with the names True and False is not sound, then assignation of the values True and False is similarly not sound.Resurrection
@ChrisMorgan: You should think more about it. The a is False is bad idea, indeed (independently on whether False and True are variables, constants, or keywords. From the linguistic/logical point of view, think about expressions like if not ambiguous is False then... (the ambiguous is a wrong identifier on its own). The a is False is the way of expressing a simple truth using rather complicated way. And this is not what should be done when programming.Escarpment
@pepr: sure, I know that is True and is False will almost never be the right way of doing it, but my objection is to @sapi's original way of putting it, that the reason it shouldn't be used is that "booleans aren't singletons".Resurrection
Yes. +1. Anyway, we should always go to the core of the problem.Escarpment
A
5

The is operator relies on the id function, which is guaranteed to be unique among simultaneously existing objects. Specifically, id returns the object's memory address. It seems that CPython has consistent memory addresses for strings containing only characters a-z and A-Z.

However, this seems to only be the case when the string has been assigned to a variable:

Here, the id of "foo" and the id of a are the same. a has been set to "foo" prior to checking the id.

>>> a = "foo"
>>> id(a)
4322269384
>>> id("foo")
4322269384

However, the id of "bar" and the id of a are different when checking the id of "bar" prior to setting a equal to "bar".

>>> id("bar")
4322269224
>>> a = "bar"
>>> id(a)
4322268984

Checking the id of "bar" again after setting a equal to "bar" returns the same id.

>>> id("bar")
4322268984

So it seems that cPython keeps consistent memory addresses for strings containing only a-zA-Z when those strings are assigned to a variable. It's also entirely possible that this is version dependent: I'm running python 2.7.3 on a macbook. Others might get entirely different results.

Aftonag answered 26/5, 2013 at 6:59 Comment(1)
I will be surprised if it is machine dependent. you probably mean "version dependent".Aloin
W
1

In fact your code amounts to comparing objects id (i.e. their physical address). So instead of your is comparison:

>>> b = 'is it the space?'
>>> a = 'is it the space?'
>>> a is b
False

You can do:

>>> id(a) == id(b)
False

But, note that if a and b were directly in the comparison it would work.

>>> id('is it the space?') == id('is it the space?')
True

In fact, in an expression there's sharing between the same static strings. But, at the program scale there's only sharing for word-like strings (so neither spaces nor punctuations).

You should not rely on this behavior as it's not documented anywhere and is a detail of implementation.

Wafd answered 26/5, 2013 at 6:29 Comment(0)
B
0

Two or more identical strings of consecutive alphanumeric (only) characters are stored in one structure, thus they share their memory reference. There are posts about this phenomenon all over the internet since the 1990's. It has evidently always been that way. I have never seen a reasonable guess as to why that's the case. I only know that it is. Furthermore, if you split and re-join alphanumeric strings to remove spaces between words, the resulting identical alphanumeric strings do NOT share a reference, which I find odd. See below:

Add any non-alphanumeric value identically to both strings, and they instantly become copies, but not shared references.

a ="abbacca";  b = "abbacca";  a is b => True
a ="abbacca "; b = "abbacca "; a is b => False
a ="abbacca?"; b = "abbacca?"; a is b => False

~Dr. C.

Backgammon answered 13/1, 2023 at 1:0 Comment(0)
P
-1

'is' operator compare the actual object.

c is d should also be false. My guess is that python make some optimization and in that case, it is the same object.

Precautious answered 26/5, 2013 at 6:18 Comment(5)
CPython keeps a pool of objects that are used frequently - short string literals and primitives like ints in the range 1-100. There is no reason to assume c is d should be false.Aloin
But on the other hand, one should not assume that c is d is true.Nosy
@Aloin my point exactly. The pool is an optimization and strings with space simply are out of the scope. @Nosy is righ, one should not assume that c is dPrecautious
@Fluff I meant "at least", I did not know the exact range... thanks.Aloin
@Precautious obviously the OP is asking about these implementation details exactly. I assume he knows not to use them.Aloin

© 2022 - 2024 — McMap. All rights reserved.