I have found a solution but it is really slow:
def chunks(self,data, SIZE=10000):
for i in xrange(0, len(data), SIZE):
yield dict(data.items()[i:i+SIZE])
Do you have any ideas without using external modules (numpy and etc.)
I have found a solution but it is really slow:
def chunks(self,data, SIZE=10000):
for i in xrange(0, len(data), SIZE):
yield dict(data.items()[i:i+SIZE])
Do you have any ideas without using external modules (numpy and etc.)
Since the dictionary is so big, it would be better to keep all the items involved to be just iterators and generators, like this
from itertools import islice
def chunks(data, SIZE=10000):
it = iter(data)
for i in range(0, len(data), SIZE):
yield {k:data[k] for k in islice(it, SIZE)}
Sample run:
for item in chunks({i:i for i in xrange(10)}, 3):
print(item)
Output
{0: 0, 1: 1, 2: 2}
{3: 3, 4: 4, 5: 5}
{8: 8, 6: 6, 7: 7}
{9: 9}
range()
instead of xrange()
in Python3 –
Chappy itertools.batched
instead of itertools.islice
. See my answer below for details. –
Sosna For Python 3+.
xrange()
was renamed to range()
in Python 3+.
You can use;
from itertools import islice
def chunks(data, SIZE=10000):
it = iter(data)
for i in range(0, len(data), SIZE):
yield {k:data[k] for k in islice(it, SIZE)}
Sample:
for item in chunks({i:i for i in range(10)}, 3):
print(item)
With following output.
{0: 0, 1: 1, 2: 2}
{3: 3, 4: 4, 5: 5}
{6: 6, 7: 7, 8: 8}
{9: 9}
Another method is iterators zipping:
>>> from itertools import izip_longest, ifilter
>>> d = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':6, 'g':7, 'h':8}
Create a list with copies of dict iterators (number of copies is number of elements in result dicts). By passing each iterator from chunks
list to izip_longest
you will get needed number of elements from source dict (ifilter
used to remove None
from zip results). With generator expression you can lower memory usage:
>>> chunks = [d.iteritems()]*3
>>> g = (dict(ifilter(None, v)) for v in izip_longest(*chunks))
>>> list(g)
[{'a': 1, 'c': 3, 'b': 2},
{'e': 5, 'd': 4, 'g': 7},
{'h': 8, 'f': 6}]
d.iteritems()
with iter(d.items())
, and not just d.items()
. This is because @npdu's approach relies on the fact that you're exhausting the same iterator (so the view object returned by d.items()
in Python 3 doesn't fulfill the same role). Other changes that you would make are replacing izip_longest
with zip_longest
and ifilter
with the built-in filter
. –
Chukar Python 3.12 introduces batched function in itertools
module. It allows to batch the data from iterable to tuples of size
passed as second parameter. Using it, you can simplify implementation from the top answer:
from itertools import batched
def chunks(data, SIZE=10000):
for batch in batched(data.items(), SIZE):
yield dict(batch)
or even make it a one-liner:
from itertools import batched
def chunks(data, SIZE=10000):
return (dict(batch) for batch in batched(data.items(), SIZE))
When tested performance of iterating over this generator for dict with million entries and batches of size 10000, it took about 0.2 seconds on my hardware, which is 33% less than 0.3 seconds per run of solution from the top answer.
Also, Kelly Bundy proposed even simpler solution in comments with pretty much identical performance:
from itertools import batched
def chunks(data, SIZE=10000):
return map(dict, batched(data.items(), SIZE))
Performance improvement comes from using dict()
and iterating over items, instead of creating the dict in the comprehension, so if you can't use Python >= 3.12, you can go with:
from itertools import islice
def chunks(data, SIZE=10000):
it = iter(data.items())
for i in range(0, len(data), SIZE):
yield dict(islice(it, SIZE))
yield from
is purely wasteful, no point wrapping your generator iterator in another generator iterator which does nothing but slow it down. How about return map(dict, batched(data.items(), SIZE))
(not tested, don't have 3.12)? –
Rationalize yield from
doesn't impact the performance in any noticeable way though. map
based solution proposed by you also works like a charm with pretty much identical performance to those from my answer. –
Sosna yield from genexp
is just strictly worse than return genexp
and in other cases it can make a noticeable difference. –
Rationalize yield from
with return
in my answer. –
Sosna This code takes a large dictionary and splits it into a list of small dictionaries. max_limit variable is to tell maximum number of key-value pairs allowed in a sub-dictionary. This code doesn't take much effort to break the dictionary, just one complete parsing over the dictionary object.
import copy
def split_dict_to_multiple(input_dict, max_limit=200):
"""Splits dict into multiple dicts with given maximum size.
Returns a list of dictionaries."""
chunks = []
curr_dict ={}
for k, v in input_dict.items():
if len(curr_dict.keys()) < max_limit:
curr_dict.update({k: v})
else:
chunks.append(copy.deepcopy(curr_dict))
curr_dict = {k: v}
# update last curr_dict
chunks.append(curr_dict)
return chunks
This code works in Python 3.8 and does not use any external modules:
def split_dict(d, n):
keys = list(d.keys())
for i in range(0, len(keys), n):
yield {k: d[k] for k in keys[i: i + n]}
for item in split_dict({i: i for i in range(10)}, 3):
print(item)
prints this:
{0: 0, 1: 1, 2: 2}
{3: 3, 4: 4, 5: 5}
{6: 6, 7: 7, 8: 8}
{9: 9}
... and might even be slightly faster than the (currently) accepted answer of thefourtheye:
from hwcounter import count, count_end
start = count()
for item in chunks({i: i for i in range(100000)}, 3):
pass
elapsed = count_end() - start
print(f'elapsed cycles: {elapsed}')
start = count()
for item in split_dict({i: i for i in range(100000)}, 3):
pass
elapsed = count_end() - start
print(f'elapsed cycles: {elapsed}')
prints
elapsed cycles: 145773597
elapsed cycles: 138041191
Something like the following should work, with only builtins:
>>> adict = {1:'a', 2:'b', 3:'c', 4:'d'}
>>> chunklen = 2
>>> dictlist = list(adict.items())
>>> [ dict(dictlist[i:i + chunklen]) for i in range(0, len(dictlist), chunklen) ]
[{1: 'a', 2: 'b'}, {3: 'c', 4: 'd'}]
This preps the original dictionary into a list of items, but you could possibly could do that in a one-liner.
import numpy as np
chunk_size = 3
chunked_data = [[k, v] for k, v in d.items()]
chunked_data = np.array_split(chunked_data, chunk_size)
Afterwards you have ndarray
which is iterable like this:
for chunk in chunked_data:
for key, value in chunk:
print(key)
print(value)
Which could be re-assigned to a list of dicts using a simple for loop.
© 2022 - 2024 — McMap. All rights reserved.
items
. You're making a new list of all the items every time you just want a slice. – Mawkishgrouper
recipe fromitertools
. – Analyseitems
. do it once. – Wagshul