How to split dictionary into multiple dictionaries fast
Asked Answered
I

8

42

I have found a solution but it is really slow:

def chunks(self,data, SIZE=10000):
    for i in xrange(0, len(data), SIZE):
        yield dict(data.items()[i:i+SIZE])

Do you have any ideas without using external modules (numpy and etc.)

Incurious answered 5/4, 2014 at 8:57 Comment(5)
Don't keep calling items. You're making a new list of all the items every time you just want a slice.Mawkish
yeah i know that, but the problem is that i can't find a different method to split my dictionary into equal sized chunks.Incurious
Try the grouper recipe from itertools.Analyse
@badc0re: still, don't keep calling items. do it once.Wagshul
note: I don't see how splitting a dictionary can be useful... what the heck are you doing?Wagshul
L
98

Since the dictionary is so big, it would be better to keep all the items involved to be just iterators and generators, like this

from itertools import islice

def chunks(data, SIZE=10000):
    it = iter(data)
    for i in range(0, len(data), SIZE):
        yield {k:data[k] for k in islice(it, SIZE)}

Sample run:

for item in chunks({i:i for i in xrange(10)}, 3):
    print(item)

Output

{0: 0, 1: 1, 2: 2}
{3: 3, 4: 4, 5: 5}
{8: 8, 6: 6, 7: 7}
{9: 9}
Lorenz answered 5/4, 2014 at 9:7 Comment(3)
Great answer. Use range() instead of xrange() in Python3Chappy
Could you elaborate on how/why iterators/generators more preferable - is it for memory efficiency?Spector
If you use Python 3.12 or newer, you can use itertools.batched instead of itertools.islice. See my answer below for details.Sosna
H
7

For Python 3+.

xrange() was renamed to range() in Python 3+.

You can use;

from itertools import islice

def chunks(data, SIZE=10000):
   it = iter(data)
   for i in range(0, len(data), SIZE):
      yield {k:data[k] for k in islice(it, SIZE)}

Sample:

for item in chunks({i:i for i in range(10)}, 3):
   print(item)

With following output.

{0: 0, 1: 1, 2: 2}
{3: 3, 4: 4, 5: 5}
{6: 6, 7: 7, 8: 8}
{9: 9}
Heathenism answered 9/3, 2021 at 22:31 Comment(0)
S
5

Another method is iterators zipping:

>>> from itertools import izip_longest, ifilter
>>> d = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':6, 'g':7, 'h':8}

Create a list with copies of dict iterators (number of copies is number of elements in result dicts). By passing each iterator from chunks list to izip_longest you will get needed number of elements from source dict (ifilter used to remove None from zip results). With generator expression you can lower memory usage:

>>> chunks = [d.iteritems()]*3
>>> g = (dict(ifilter(None, v)) for v in izip_longest(*chunks))
>>> list(g)
[{'a': 1, 'c': 3, 'b': 2},
 {'e': 5, 'd': 4, 'g': 7},
 {'h': 8, 'f': 6}]
Shaffer answered 5/4, 2014 at 9:48 Comment(1)
If taking this approach in Python 3, it's important to replace d.iteritems() with iter(d.items()), and not just d.items(). This is because @npdu's approach relies on the fact that you're exhausting the same iterator (so the view object returned by d.items() in Python 3 doesn't fulfill the same role). Other changes that you would make are replacing izip_longest with zip_longest and ifilter with the built-in filter.Chukar
S
3

Python 3.12 introduces batched function in itertools module. It allows to batch the data from iterable to tuples of size passed as second parameter. Using it, you can simplify implementation from the top answer:

from itertools import batched

def chunks(data, SIZE=10000):
    for batch in batched(data.items(), SIZE):
        yield dict(batch)

or even make it a one-liner:

from itertools import batched

def chunks(data, SIZE=10000):
    return (dict(batch) for batch in batched(data.items(), SIZE))

When tested performance of iterating over this generator for dict with million entries and batches of size 10000, it took about 0.2 seconds on my hardware, which is 33% less than 0.3 seconds per run of solution from the top answer.

Also, Kelly Bundy proposed even simpler solution in comments with pretty much identical performance:

from itertools import batched

def chunks(data, SIZE=10000):
    return map(dict, batched(data.items(), SIZE))

Performance improvement comes from using dict() and iterating over items, instead of creating the dict in the comprehension, so if you can't use Python >= 3.12, you can go with:

from itertools import islice

def chunks(data, SIZE=10000):
    it = iter(data.items())
    for i in range(0, len(data), SIZE):
        yield dict(islice(it, SIZE))
Sosna answered 17/1 at 11:29 Comment(4)
Your yield from is purely wasteful, no point wrapping your generator iterator in another generator iterator which does nothing but slow it down. How about return map(dict, batched(data.items(), SIZE)) (not tested, don't have 3.12)?Rationalize
Makes sense, I could just return a generator from comprehension. This yield from doesn't impact the performance in any noticeable way though. map based solution proposed by you also works like a charm with pretty much identical performance to those from my answer.Sosna
Yeah, with SIZE=10000, almost all time is spent by the underlying operations (batching and dict-building). You might see a little difference with small SIZE values. But mostly I meant this out of principle. A generator that does nothing but yield from genexp is just strictly worse than return genexp and in other cases it can make a noticeable difference.Rationalize
Agreed, I replaced yield from with return in my answer.Sosna
M
1

This code takes a large dictionary and splits it into a list of small dictionaries. max_limit variable is to tell maximum number of key-value pairs allowed in a sub-dictionary. This code doesn't take much effort to break the dictionary, just one complete parsing over the dictionary object.

import copy
def split_dict_to_multiple(input_dict, max_limit=200):
"""Splits dict into multiple dicts with given maximum size. 
Returns a list of dictionaries."""
chunks = []
curr_dict ={}
for k, v in input_dict.items():
    if len(curr_dict.keys()) < max_limit:
        curr_dict.update({k: v})
    else:
        chunks.append(copy.deepcopy(curr_dict))
        curr_dict = {k: v}
# update last curr_dict
chunks.append(curr_dict)
return chunks
Mouthful answered 26/9, 2019 at 1:43 Comment(1)
better to provide some explanation with code snippetsAurthur
C
1

This code works in Python 3.8 and does not use any external modules:

def split_dict(d, n):
    keys = list(d.keys())
    for i in range(0, len(keys), n):
        yield {k: d[k] for k in keys[i: i + n]}


for item in split_dict({i: i for i in range(10)}, 3):
    print(item)

prints this:

{0: 0, 1: 1, 2: 2}
{3: 3, 4: 4, 5: 5}
{6: 6, 7: 7, 8: 8}
{9: 9}

... and might even be slightly faster than the (currently) accepted answer of thefourtheye:

from hwcounter import count, count_end


start = count()
for item in chunks({i: i for i in range(100000)}, 3):
    pass
elapsed = count_end() - start
print(f'elapsed cycles: {elapsed}')

start = count()
for item in split_dict({i: i for i in range(100000)}, 3):
    pass
elapsed = count_end() - start
print(f'elapsed cycles: {elapsed}')

prints

elapsed cycles: 145773597
elapsed cycles: 138041191
Collard answered 27/12, 2021 at 16:21 Comment(1)
Why do not you use the Python's 'timeit' module to measure performance?Sidneysidoma
S
0

Something like the following should work, with only builtins:

>>> adict = {1:'a', 2:'b', 3:'c', 4:'d'}
>>> chunklen = 2
>>> dictlist = list(adict.items())
>>> [ dict(dictlist[i:i + chunklen]) for i in range(0, len(dictlist), chunklen) ]
[{1: 'a', 2: 'b'}, {3: 'c', 4: 'd'}]

This preps the original dictionary into a list of items, but you could possibly could do that in a one-liner.

Steradian answered 18/12, 2022 at 15:20 Comment(0)
R
-1
import numpy as np
chunk_size = 3
chunked_data = [[k, v] for k, v in d.items()]
chunked_data = np.array_split(chunked_data, chunk_size)

Afterwards you have ndarray which is iterable like this:

for chunk in chunked_data:
    for key, value in chunk:
        print(key)
        print(value)

Which could be re-assigned to a list of dicts using a simple for loop.

Rule answered 11/5, 2019 at 18:52 Comment(1)
Perhaps downvoted because it's an obvious overkill to use a numpy ndarray to chunk native dictionaries. The OP expressed the need to not use any external module, explicitly mentioning numpy.Scrivner

© 2022 - 2024 — McMap. All rights reserved.