Why does numpy's fromiter function require specifying the dtype when other array creation routines don't?

Memory preallocation?

What I understand is that if you know the dtype and the count, it allows preallocating memory to the resulting np.array, and that if you don't specify the optional count argument that it will "resize the output array on demand". But if you do not specify the count, it would seem that you should be able to infer the dtype on the fly in the same way that you can in a normal np.array call.

A couple ways of restating the question

So why is it that you need to specify the dtype to use np.fromiter; or put another way what are the gains that result from specifying the dtype if the array is going to be resized on demand anyway?

A more subtle version of the same question that is more directly related to my problem: I know many of the efficiency gains of np.ndarrays are lost when you're constantly resizing them, so what is gained from using np.fromiter(generator,dtype=d) over np.fromiter([gen_elem for gen_elem in generator],dtype=d) over np.array([gen_elem for gen_elem in generator],dtype=d)?

If this code was written a decade ago, and there hasn't been pressure to change it, then the old reasons still apply. Most people are happy using np.array. np.fromiter is mainly used by people who are trying squeeze out some speed from iterative methods of generating values.

My impression is that np.array, the main alternative reads/processes the whole input, before deciding on the dtype (and other properties):

I can force a float return just by changing one element:

In [395]: np.array([0,1,2,3,4,5])
Out[395]: array([0, 1, 2, 3, 4, 5])
In [396]: np.array([0,1,2,3,4,5,6.])
Out[396]: array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.])

I don't use fromiter much, but my sense is that by requiring dtype, it can start converting the inputs to that type right from the start. That could end up producing a faster iteration, though that needs time tests.

I know that the np.array generality comes at a certain time cost. Often for small lists it is faster to use a list comprehension than to convert it to an array - even though array operations are fast.

Some time tests:

In [404]: timeit np.fromiter([0,1,2,3,4,5,6.],dtype=int)
100000 loops, best of 3: 3.35 µs per loop
In [405]: timeit np.fromiter([0,1,2,3,4,5,6.],dtype=float)
100000 loops, best of 3: 3.88 µs per loop
In [406]: timeit np.array([0,1,2,3,4,5,6.])
100000 loops, best of 3: 4.51 µs per loop
In [407]: timeit np.array([0,1,2,3,4,5,6])
100000 loops, best of 3: 3.93 µs per loop

The differences are small, but suggest my reasoning is correct. Requiring dtype helps keep fromiter faster. count does not make a difference in this small size.

Curiously, specifying a dtype for np.array slows it down. It's as though it appends a astype call:

In [416]: timeit np.array([0,1,2,3,4,5,6],dtype=float)
100000 loops, best of 3: 6.52 µs per loop
In [417]: timeit np.array([0,1,2,3,4,5,6]).astype(float)
100000 loops, best of 3: 6.21 µs per loop

The differences between np.array and np.fromiter are more dramatic when I use range(1000) (Python3 generator version)

In [430]: timeit np.array(range(1000))
1000 loops, best of 3: 704 µs per loop

Actually, turning the range into a list is faster:

In [431]: timeit np.array(list(range(1000)))
1000 loops, best of 3: 196 µs per loop

but fromiter is still faster:

In [432]: timeit np.fromiter(range(1000),dtype=int)
10000 loops, best of 3: 87.6 µs per loop

It is faster to apply the int to float conversion on the whole array than to each element during the generation/iteration

In [434]: timeit np.fromiter(range(1000),dtype=int).astype(float)
10000 loops, best of 3: 106 µs per loop
In [435]: timeit np.fromiter(range(1000),dtype=float)
1000 loops, best of 3: 189 µs per loop

Note that the astype resizing operation is not that expensive, only some 20 µs.

============================

array_fromiter(PyObject *NPY_UNUSED(ignored), PyObject *args, PyObject *keywds) is defined in:

https://github.com/numpy/numpy/blob/eeba2cbfa4c56447e36aad6d97e323ecfbdade56/numpy/core/src/multiarray/multiarraymodule.c

It processes the keywds and calls PyArray_FromIter(PyObject *obj, PyArray_Descr *dtype, npy_intp count) in https://github.com/numpy/numpy/blob/97c35365beda55c6dead8c50df785eb857f843f0/numpy/core/src/multiarray/ctors.c

This makes an initial array ret using the defined dtype:

ret = (PyArrayObject *)PyArray_NewFromDescr(&PyArray_Type, dtype, 1,
                                            &elcount, NULL,NULL, 0, NULL);

The data attribute of this array is grown with 50% overallocation => 0, 4, 8, 14, 23, 36, 56, 86 ..., and shrunk to fit at the end.

The dtype of this array, PyArray_DESCR(ret), apparently has a function that can take value (provided by the iterator next), convert it, and set it in the data.

`(PyArray_DESCR(ret)->f->setitem(value, item, ret)`

In other words, all the dtype conversion is done by the defined dtype. The code would be lot more complicated if it decided 'on the fly' how to convert the value (and all previously allocated ones). Most of the code in this function deals with allocating the data buffer.

I'll hold off on looking up np.array. I'm sure it is much more complex.

First attempts at a guess:

Memory preallocation?

Datatype recasting?

A couple ways of restating the question

Recommended topics

Hot tags