Dictionary access speed comparison with integer key against string key
Asked Answered
C

5

32

I've got a large dictionary from which I have to look up for values a lot of times. My keys are integers but represent labels so do not need to be added, subtracted, etc... I ended up trying to assess access time between string key and integer key dictionary and here is the result.

from timeit import Timer

Dint = dict()
Dstr = dict()

for i in range(10000):
    Dint[i] = i
    Dstr[str(i)] = i


print 'string key in Dint',
print(Timer("'7498' in Dint", "from __main__ import Dint").timeit(100000000))
print 'int key in Dint',
print(Timer("7498 in Dint", "from __main__ import Dint").timeit(100000000))
print 'string key in Dstr',
print(Timer("'7498' in Dstr", "from __main__ import Dstr").timeit(100000000))
print 'int key in Dstr',
print(Timer("7498 in Dstr", "from __main__ import Dstr").timeit(100000000))

which produces slight variations between runs reproduced each time :

string key in Dint 4.5552944017
int key in Dint 7.14334390267
string key in Dstr 6.69923791116
int key in Dstr 5.03503126455

Does it prove that using dictionary with strings as keys is faster to access than with integers as keys?

Czarevna answered 6/12, 2011 at 16:53 Comment(1)
It would be rather nicer if you used more than one key.Mcintire
A
38

CPython's dict implementation is in fact optimized for string key lookups. There are two different functions, lookdict and lookdict_string (lookdict_unicode in Python 3), which can be used to perform lookups. Python will use the string-optimized version until a search for non-string data, after which the more general function is used. You can look at the actual implementation by downloading CPython's source and reading through dictobject.c.

As a result of this optimization, lookups are faster when a dict has all string keys.

Annul answered 6/12, 2011 at 17:0 Comment(0)
T
6

I'm afraid your times don't really prove very much.

Your test for string in Dint is fastest: in general a test for anything that is not in a dictionary is quite likely to be fast, but that's only because you were lucky and first time hit an empty cell so the lookup could terminate. If you were unlucky and chose a value that hit one or more full cells then it could end up slower than the cases that actually find something.

Testing for an arbitrary string in a dictionary has to calculate the hash code for the string. That takes time proportional to the length of the string, but Python has a neat trick and only ever calculates it once for each string. Since you use the same string over and over in your timing test the time taken to calculate the hash is lost as it only happens the first time and not the other 99999999 times. If you were using a different string each time you would get a very different result.

Python has optimised code for dictionaries where the keys are strings. Overall you should find that using string keys where you use the same keys multiple times is slightly faster, but if you have to keep converting integers to string before the lookup you'll lose that advantage.

Teratology answered 6/12, 2011 at 17:5 Comment(0)
H
5

This was my question too. Apparently, dictionaries with string keys are more efficient, but access times are really close. I ran the following code using Python 3:

import random
import timeit
import uuid

DICT_INT = dict()
DICT_STR = dict()
DICT_MIX = dict()

for i in range(2000000):
    DICT_INT[i] = uuid.uuid4().hex
    DICT_STR[str(i)] = uuid.uuid4().hex
    DICT_MIX[i if random.randrange(2) else str(i)] = uuid.uuid4().hex

def int_lookup():
    int_key = random.randrange(len(DICT_INT))
    str_key = str(int_key)
    mix_key = int_key if int_key % 2 else str_key
    return int_key in DICT_INT

def str_lookup():
    int_key = random.randrange(len(DICT_STR))
    str_key = str(int_key)
    mix_key = int_key if int_key % 2 else str_key
    return str_key in DICT_STR

def mix_lookup():
    int_key = random.randrange(len(DICT_MIX))
    str_key = str(int_key)
    mix_key = int_key if int_key % 2 else str_key
    return mix_key in DICT_MIX

print('Int dict lookup: ', end='')
print(timeit.timeit('int_lookup', 'from __main__ import int_lookup', number=1000000000))
print('Str dict lookup: ', end='')
print(timeit.timeit("str_lookup", 'from __main__ import str_lookup', number=1000000000))
print('Mix dict lookup: ', end='')
print(timeit.timeit("mix_lookup", 'from __main__ import mix_lookup', number=1000000000))

and this is the result:

Int dict lookup: 12.395361029000014
Str dict lookup: 12.097380312000041
Mix dict lookup: 12.109765773000163
Hispania answered 1/12, 2020 at 0:53 Comment(1)
This code measures things like random.randrange, string conversion, ternary operator and so the results are skewed. In general int lookup is faster than string lookup.Eustis
E
0

As others said Python provides specialized dictionaries and generally int lookup is faster than string lookup.

The correct test should be something like this

import random
import timeit
import uuid

DICT_INT = dict()
DICT_STR = dict()
DICT_MIX = dict()

KEYS_INT = []
KEYS_STR = []
KEYS_MIX = []

for i in range(2000000):
    key_int = i
    key_str = str(i)
    key_mix = i if random.randrange(2) else str(i)
    KEYS_INT.append(key_int)
    KEYS_STR.append(key_str)
    KEYS_MIX.append(key_mix)
    DICT_INT[key_int] = uuid.uuid4().hex
    DICT_STR[key_str] = uuid.uuid4().hex
    DICT_MIX[key_mix] = uuid.uuid4().hex

def int_lookup():
    for key in KEYS_INT:
        x = key in DICT_INT

def str_lookup():
    for key in KEYS_STR:
        x = key in DICT_STR

def mix_lookup():
    for key in KEYS_MIX:
        x = key in DICT_MIX

print('Int dict lookup:', timeit.timeit(int_lookup, number=100))
print('Str dict lookup:', timeit.timeit(str_lookup, number=100))
print('Mix dict lookup:', timeit.timeit(mix_lookup, number=100))

Otherwise you measure things like random.randrange, string conversion, ternary operator, etc.

The result on my machine is

Int dict lookup: 4.126786124999999
Str dict lookup: 22.824602666999997
Mix dict lookup: 19.024495125
Eustis answered 13/5, 2023 at 20:4 Comment(0)
B
0

When I ran the code below 10x. On average, it took 3x longer to build the dictionary when int_flag was False and 25% longer to do the lookups. lookup_i is a list with 546 psuedo random keys. It was hard coded to ensure the same lookup keys were always used.

_loop = 100000000

def psuedo_main(int_flag): global _loop, _lookup_str, _lookup_i

print('Test: ' + 'Integer' if int_flag else 'String')

test_d = dict()
start_time = datetime.datetime.now().timestamp()
for i in range(0, _loop):
    buf = gen_util.pad_string(str(i), 9, '0')
    key = i if int_flag else buf
    test_d[key] = buf[0:3] + '-' + buf[3:5] + '-' + buf[5:9]
print('Build Time: ' + str(datetime.datetime.now().timestamp() - start_time))
start_time = datetime.datetime.now().timestamp()
key_l = _lookup_i if int_flag else _lookup_str
for key in key_l:
    print(test_d[key])
print('Lookup Time: ' + str(datetime.datetime.now().timestamp() - start_time))
Bargello answered 2/6 at 16:26 Comment(1)
Please correct your code formatting, and ensure the indentation is correct. ThanksWagonlit

© 2022 - 2024 — McMap. All rights reserved.