How to split dictionary into multiple dictionaries fast

Asked 5/4, 2014 at 8:57 Answered 17/1 at 11:29

I have found a solution but it is really slow:

def chunks(self,data, SIZE=10000):
    for i in xrange(0, len(data), SIZE):
        yield dict(data.items()[i:i+SIZE])

Do you have any ideas without using external modules (numpy and etc.)

Incurious answered 5/4, 2014 at 8:57 Comment(5)

Don't keep calling items. You're making a new list of all the items every time you just want a slice. – Mawkish 5/4, 2014 at 9:0

yeah i know that, but the problem is that i can't find a different method to split my dictionary into equal sized chunks. – Incurious 5/4, 2014 at 9:2

Try the grouper recipe from itertools. – Analyse 5/4, 2014 at 9:3

@badc0re: still, don't keep calling items. do it once. – Wagshul 5/4, 2014 at 9:4

note: I don't see how splitting a dictionary can be useful... what the heck are you doing? – Wagshul 5/4, 2014 at 9:17

Since the dictionary is so big, it would be better to keep all the items involved to be just iterators and generators, like this

from itertools import islice

def chunks(data, SIZE=10000):
    it = iter(data)
    for i in range(0, len(data), SIZE):
        yield {k:data[k] for k in islice(it, SIZE)}

Sample run:

for item in chunks({i:i for i in xrange(10)}, 3):
    print(item)

Output

{0: 0, 1: 1, 2: 2}
{3: 3, 4: 4, 5: 5}
{8: 8, 6: 6, 7: 7}
{9: 9}

Lorenz answered 5/4, 2014 at 9:7 Comment(3)

Great answer. Use range() instead of xrange() in Python3 – Chappy 21/6, 2018 at 12:40

Could you elaborate on how/why iterators/generators more preferable - is it for memory efficiency? – Spector 17/11, 2020 at 16:52

If you use Python 3.12 or newer, you can use itertools.batched instead of itertools.islice. See my answer below for details. – Sosna 17/1 at 12:13

For Python 3+.

xrange() was renamed to range() in Python 3+.

You can use;

from itertools import islice

def chunks(data, SIZE=10000):
   it = iter(data)
   for i in range(0, len(data), SIZE):
      yield {k:data[k] for k in islice(it, SIZE)}

Sample:

for item in chunks({i:i for i in range(10)}, 3):
   print(item)

With following output.

{0: 0, 1: 1, 2: 2}
{3: 3, 4: 4, 5: 5}
{6: 6, 7: 7, 8: 8}
{9: 9}

Heathenism answered 9/3, 2021 at 22:31 Comment(0)

Another method is iterators zipping:

>>> from itertools import izip_longest, ifilter
>>> d = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':6, 'g':7, 'h':8}

Create a list with copies of dict iterators (number of copies is number of elements in result dicts). By passing each iterator from chunks list to izip_longest you will get needed number of elements from source dict (ifilter used to remove None from zip results). With generator expression you can lower memory usage:

>>> chunks = [d.iteritems()]*3
>>> g = (dict(ifilter(None, v)) for v in izip_longest(*chunks))
>>> list(g)
[{'a': 1, 'c': 3, 'b': 2},
 {'e': 5, 'd': 4, 'g': 7},
 {'h': 8, 'f': 6}]

Shaffer answered 5/4, 2014 at 9:48 Comment(1)

If taking this approach in Python 3, it's important to replace d.iteritems() with iter(d.items()), and not just d.items(). This is because @npdu's approach relies on the fact that you're exhausting the same iterator (so the view object returned by d.items() in Python 3 doesn't fulfill the same role). Other changes that you would make are replacing izip_longest with zip_longest and ifilter with the built-in filter. – Chukar 11/7, 2022 at 10:53

Python 3.12 introduces batched function in itertools module. It allows to batch the data from iterable to tuples of size passed as second parameter. Using it, you can simplify implementation from the top answer:

from itertools import batched

def chunks(data, SIZE=10000):
    for batch in batched(data.items(), SIZE):
        yield dict(batch)

or even make it a one-liner:

from itertools import batched

def chunks(data, SIZE=10000):
    return (dict(batch) for batch in batched(data.items(), SIZE))

When tested performance of iterating over this generator for dict with million entries and batches of size 10000, it took about 0.2 seconds on my hardware, which is 33% less than 0.3 seconds per run of solution from the top answer.

Also, Kelly Bundy proposed even simpler solution in comments with pretty much identical performance:

from itertools import batched

def chunks(data, SIZE=10000):
    return map(dict, batched(data.items(), SIZE))

Performance improvement comes from using dict() and iterating over items, instead of creating the dict in the comprehension, so if you can't use Python >= 3.12, you can go with:

from itertools import islice

def chunks(data, SIZE=10000):
    it = iter(data.items())
    for i in range(0, len(data), SIZE):
        yield dict(islice(it, SIZE))

Sosna answered 17/1 at 11:29 Comment(4)

Your yield from is purely wasteful, no point wrapping your generator iterator in another generator iterator which does nothing but slow it down. How about return map(dict, batched(data.items(), SIZE)) (not tested, don't have 3.12)? – Rationalize 18/1 at 12:56

Makes sense, I could just return a generator from comprehension. This yield from doesn't impact the performance in any noticeable way though. map based solution proposed by you also works like a charm with pretty much identical performance to those from my answer. – Sosna 19/1 at 8:29

Yeah, with SIZE=10000, almost all time is spent by the underlying operations (batching and dict-building). You might see a little difference with small SIZE values. But mostly I meant this out of principle. A generator that does nothing but yield from genexp is just strictly worse than return genexp and in other cases it can make a noticeable difference. – Rationalize 19/1 at 15:26

Agreed, I replaced yield from with return in my answer. – Sosna 20/1 at 23:30

This code takes a large dictionary and splits it into a list of small dictionaries. max_limit variable is to tell maximum number of key-value pairs allowed in a sub-dictionary. This code doesn't take much effort to break the dictionary, just one complete parsing over the dictionary object.

import copy
def split_dict_to_multiple(input_dict, max_limit=200):
"""Splits dict into multiple dicts with given maximum size. 
Returns a list of dictionaries."""
chunks = []
curr_dict ={}
for k, v in input_dict.items():
    if len(curr_dict.keys()) < max_limit:
        curr_dict.update({k: v})
    else:
        chunks.append(copy.deepcopy(curr_dict))
        curr_dict = {k: v}
# update last curr_dict
chunks.append(curr_dict)
return chunks

Mouthful answered 26/9, 2019 at 1:43 Comment(1)

better to provide some explanation with code snippets – Aurthur 26/9, 2019 at 2:21

This code works in Python 3.8 and does not use any external modules:

def split_dict(d, n):
    keys = list(d.keys())
    for i in range(0, len(keys), n):
        yield {k: d[k] for k in keys[i: i + n]}


for item in split_dict({i: i for i in range(10)}, 3):
    print(item)

prints this:

{0: 0, 1: 1, 2: 2}
{3: 3, 4: 4, 5: 5}
{6: 6, 7: 7, 8: 8}
{9: 9}

... and might even be slightly faster than the (currently) accepted answer of thefourtheye:

from hwcounter import count, count_end


start = count()
for item in chunks({i: i for i in range(100000)}, 3):
    pass
elapsed = count_end() - start
print(f'elapsed cycles: {elapsed}')

start = count()
for item in split_dict({i: i for i in range(100000)}, 3):
    pass
elapsed = count_end() - start
print(f'elapsed cycles: {elapsed}')

prints

elapsed cycles: 145773597
elapsed cycles: 138041191

Collard answered 27/12, 2021 at 16:21 Comment(1)

Why do not you use the Python's 'timeit' module to measure performance? – Sidneysidoma 11/2, 2022 at 13:21

Something like the following should work, with only builtins:

>>> adict = {1:'a', 2:'b', 3:'c', 4:'d'}
>>> chunklen = 2
>>> dictlist = list(adict.items())
>>> [ dict(dictlist[i:i + chunklen]) for i in range(0, len(dictlist), chunklen) ]
[{1: 'a', 2: 'b'}, {3: 'c', 4: 'd'}]

This preps the original dictionary into a list of items, but you could possibly could do that in a one-liner.

Steradian answered 18/12, 2022 at 15:20 Comment(0)

-1

import numpy as np
chunk_size = 3
chunked_data = [[k, v] for k, v in d.items()]
chunked_data = np.array_split(chunked_data, chunk_size)

Afterwards you have ndarray which is iterable like this:

for chunk in chunked_data:
    for key, value in chunk:
        print(key)
        print(value)

Which could be re-assigned to a list of dicts using a simple for loop.

Rule answered 11/5, 2019 at 18:52 Comment(1)

Perhaps downvoted because it's an obvious overkill to use a numpy ndarray to chunk native dictionaries. The OP expressed the need to not use any external module, explicitly mentioning numpy. – Scrivner 28/1, 2020 at 4:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags