In order to improve memory efficiency, I've been working on converting some of my code from lists to generators/iterators where I can. I've found a lot of instances of cases where I am just converting a list I've made to an np.array
with the code pattern np.array(some_list)
.
Notably, some_list
is often a list comprehension that is iterating over a generator.
I was looking into np.fromiter
to see if I could use the generator more directly (rather than having to first cast it into a list to then convert it into an numpy array), but I noticed that the np.fromiter
function, unlike any other array creation routine that uses existing data requires specifying the dtype
.
In most of my particular cases, I can make that work(mostly dealing with loglikelihoods so float64 will be fine), but it left me wondering why it was that this is only necessary for the fromiter
array creator and not other array creators.
First attempts at a guess:
Memory preallocation?
What I understand is that if you know the dtype
and the count
, it allows preallocating memory to the resulting np.array
, and that if you don't specify the optional count
argument that it will "resize the output array on demand". But if you do not specify the count, it would seem that you should be able to infer the dtype
on the fly in the same way that you can in a normal np.array
call.
Datatype recasting?
I could see this being useful for recasting data into new dtype
s, but that would hold for other array creation routines as well, and would seem to merit placement as an optional but not required argument.
A couple ways of restating the question
So why is it that you need to specify the dtype
to use np.fromiter
; or put another way what are the gains that result from specifying the dtype
if the array is going to be resized on demand anyway?
A more subtle version of the same question that is more directly related to my problem:
I know many of the efficiency gains of np.ndarray
s are lost when you're constantly resizing them, so what is gained from using np.fromiter(generator,dtype=d)
over np.fromiter([gen_elem for gen_elem in generator],dtype=d)
over np.array([gen_elem for gen_elem in generator],dtype=d)
?
dtype
is required other than that Tim Hochberg who wrote it wanted wanted a 1d array with a specifieddtype
. Re: adding a shape parameter, they mention that they didn't want to make the code more complicated & thatarray
has many complications…but that doesn't actually answer whydtype
still is required forfromiter
but not for any other array creation routines. Also, the thread is almost decade old and numpy has changed a lot since then – so presumably changing it for API consistency was considered. – Batik