In Python, why do separate dictionary string values pass "in" equality checks? ( string Interning Experiment )
Asked Answered
A

1

5

I am building a Python utility that will involve mapping integers to word strings, where many integers might map to the same string. From my understanding, Python interns short strings and most hard-coded strings by default, saving memory overhead as a result by keeping a "canonical" version of the string in a table. I thought that I could benefit from this by interning string values, even though string interning is built more for key hashing optimization. I wrote a quick test that checks string equality for long strings, first with just strings stored in a list, and then strings stored in a dictionary as values. The behavior is unexpected to me:

import sys

top = 10000

non1 = []
non2 = []
for i in range(top):
    s1 = '{:010d}'.format(i)
    s2 = '{:010d}'.format(i)
    non1.append(s1)
    non2.append(s2)

same = True
for i in range(top):
    same = same and (non1[i] is non2[i])
print("non: ", same) # prints False
del non1[:]
del non2[:]


with1 = []
with2 = []
for i in range(top):
    s1 = sys.intern('{:010d}'.format(i))
    s2 = sys.intern('{:010d}'.format(i))
    with1.append(s1)
    with2.append(s2)

same = True
for i in range(top):
    same = same and (with1[i] is with2[i])
print("with: ", same) # prints True

###############################

non_dict = {}
non_dict[1] = "this is a long string"
non_dict[2] = "this is another long string"
non_dict[3] = "this is a long string"
non_dict[4] = "this is another long string"

with_dict = {}
with_dict[1] = sys.intern("this is a long string")
with_dict[2] = sys.intern("this is another long string")
with_dict[3] = sys.intern("this is a long string")
with_dict[4] = sys.intern("this is another long string")

print("non: ",  non_dict[1] is non_dict[3] and non_dict[2] is non_dict[4]) # prints True ???
print("with: ", with_dict[1] is with_dict[3] and with_dict[2] is with_dict[4]) # prints True

I thought that the non-dict checks would result in a "False" print-out, but I was clearly mistaken. Would anyone know what is happening, and whether string interning would yield any benefits at all in my case? I could have many, many more keys than single value if I consolidate data from several input texts, so I am searching for a way to save memory space. (Maybe I will have to use a data-base, but that is outside the scope of this question.) Thank you in advance!

Alatea answered 1/1, 2017 at 2:15 Comment(1)
What 2357112 said. Note that constructed strings generally won't recycle an interned value, eg a="a long string";b="a long" + " string";print(id(a)==id(b)) prints FalseAntecedence
M
4

One of the optimizations performed by the bytecode compiler, similar to but distinct from interning, is that it will use the same object for equal constants in the same code block. The string literals here:

non_dict = {}
non_dict[1] = "this is a long string"
non_dict[2] = "this is another long string"
non_dict[3] = "this is a long string"
non_dict[4] = "this is another long string"

are in the same code block, so equal strings end up represented by the same string object.

Michellemichels answered 1/1, 2017 at 2:20 Comment(1)
Ah that's right! I just tried this, and introducing the run-time variability leads to the expected False print-out. Thanks for clarifying. u_in = input("enter a runtime string: ") non_dict = {} non_dict[1] = "this is a long string" + u_in non_dict[2] = "this is another long string" + u_in non_dict[3] = "this is a long string" + u_in non_dict[4] = "this is another long string" + u_inAlatea

© 2022 - 2024 — McMap. All rights reserved.