Why does numpy's fromiter function require specifying the dtype when other array creation routines don't?
Asked Answered
B

1

8

In order to improve memory efficiency, I've been working on converting some of my code from lists to generators/iterators where I can. I've found a lot of instances of cases where I am just converting a list I've made to an np.array with the code pattern np.array(some_list).

Notably, some_list is often a list comprehension that is iterating over a generator.

I was looking into np.fromiter to see if I could use the generator more directly (rather than having to first cast it into a list to then convert it into an numpy array), but I noticed that the np.fromiter function, unlike any other array creation routine that uses existing data requires specifying the dtype.

In most of my particular cases, I can make that work(mostly dealing with loglikelihoods so float64 will be fine), but it left me wondering why it was that this is only necessary for the fromiter array creator and not other array creators.

First attempts at a guess:

Memory preallocation?

What I understand is that if you know the dtype and the count, it allows preallocating memory to the resulting np.array, and that if you don't specify the optional count argument that it will "resize the output array on demand". But if you do not specify the count, it would seem that you should be able to infer the dtype on the fly in the same way that you can in a normal np.array call.

Datatype recasting?

I could see this being useful for recasting data into new dtypes, but that would hold for other array creation routines as well, and would seem to merit placement as an optional but not required argument.

A couple ways of restating the question

So why is it that you need to specify the dtype to use np.fromiter; or put another way what are the gains that result from specifying the dtype if the array is going to be resized on demand anyway?

A more subtle version of the same question that is more directly related to my problem: I know many of the efficiency gains of np.ndarrays are lost when you're constantly resizing them, so what is gained from using np.fromiter(generator,dtype=d) over np.fromiter([gen_elem for gen_elem in generator],dtype=d) over np.array([gen_elem for gen_elem in generator],dtype=d)?

Batik answered 1/12, 2015 at 22:4 Comment(4)
This link references the reason: sourceforge.net/p/numpy/mailman/message/13497603Gravely
@toasteez This is great! But it actually doesn't seem to say anything about why dtype is required other than that Tim Hochberg who wrote it wanted wanted a 1d array with a specified dtype. Re: adding a shape parameter, they mention that they didn't want to make the code more complicated & that array has many complications…but that doesn't actually answer why dtype still is required for fromiter but not for any other array creation routines. Also, the thread is almost decade old and numpy has changed a lot since then – so presumably changing it for API consistency was considered.Batik
probably best to raise an issue / question on the numpy github,Gravely
a warning ... converting to numpy arrays will almost certainly not help your physical memory much because numpy arrays need contiguous memory blocks which are much harder to come by ...Minica
B
6

If this code was written a decade ago, and there hasn't been pressure to change it, then the old reasons still apply. Most people are happy using np.array. np.fromiter is mainly used by people who are trying squeeze out some speed from iterative methods of generating values.

My impression is that np.array, the main alternative reads/processes the whole input, before deciding on the dtype (and other properties):

I can force a float return just by changing one element:

In [395]: np.array([0,1,2,3,4,5])
Out[395]: array([0, 1, 2, 3, 4, 5])
In [396]: np.array([0,1,2,3,4,5,6.])
Out[396]: array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.])

I don't use fromiter much, but my sense is that by requiring dtype, it can start converting the inputs to that type right from the start. That could end up producing a faster iteration, though that needs time tests.

I know that the np.array generality comes at a certain time cost. Often for small lists it is faster to use a list comprehension than to convert it to an array - even though array operations are fast.

Some time tests:

In [404]: timeit np.fromiter([0,1,2,3,4,5,6.],dtype=int)
100000 loops, best of 3: 3.35 µs per loop
In [405]: timeit np.fromiter([0,1,2,3,4,5,6.],dtype=float)
100000 loops, best of 3: 3.88 µs per loop
In [406]: timeit np.array([0,1,2,3,4,5,6.])
100000 loops, best of 3: 4.51 µs per loop
In [407]: timeit np.array([0,1,2,3,4,5,6])
100000 loops, best of 3: 3.93 µs per loop

The differences are small, but suggest my reasoning is correct. Requiring dtype helps keep fromiter faster. count does not make a difference in this small size.

Curiously, specifying a dtype for np.array slows it down. It's as though it appends a astype call:

In [416]: timeit np.array([0,1,2,3,4,5,6],dtype=float)
100000 loops, best of 3: 6.52 µs per loop
In [417]: timeit np.array([0,1,2,3,4,5,6]).astype(float)
100000 loops, best of 3: 6.21 µs per loop

The differences between np.array and np.fromiter are more dramatic when I use range(1000) (Python3 generator version)

In [430]: timeit np.array(range(1000))
1000 loops, best of 3: 704 µs per loop

Actually, turning the range into a list is faster:

In [431]: timeit np.array(list(range(1000)))
1000 loops, best of 3: 196 µs per loop

but fromiter is still faster:

In [432]: timeit np.fromiter(range(1000),dtype=int)
10000 loops, best of 3: 87.6 µs per loop

It is faster to apply the int to float conversion on the whole array than to each element during the generation/iteration

In [434]: timeit np.fromiter(range(1000),dtype=int).astype(float)
10000 loops, best of 3: 106 µs per loop
In [435]: timeit np.fromiter(range(1000),dtype=float)
1000 loops, best of 3: 189 µs per loop

Note that the astype resizing operation is not that expensive, only some 20 µs.

============================

array_fromiter(PyObject *NPY_UNUSED(ignored), PyObject *args, PyObject *keywds) is defined in:

https://github.com/numpy/numpy/blob/eeba2cbfa4c56447e36aad6d97e323ecfbdade56/numpy/core/src/multiarray/multiarraymodule.c

It processes the keywds and calls PyArray_FromIter(PyObject *obj, PyArray_Descr *dtype, npy_intp count) in https://github.com/numpy/numpy/blob/97c35365beda55c6dead8c50df785eb857f843f0/numpy/core/src/multiarray/ctors.c

This makes an initial array ret using the defined dtype:

ret = (PyArrayObject *)PyArray_NewFromDescr(&PyArray_Type, dtype, 1,
                                            &elcount, NULL,NULL, 0, NULL);

The data attribute of this array is grown with 50% overallocation => 0, 4, 8, 14, 23, 36, 56, 86 ..., and shrunk to fit at the end.

The dtype of this array, PyArray_DESCR(ret), apparently has a function that can take value (provided by the iterator next), convert it, and set it in the data.

`(PyArray_DESCR(ret)->f->setitem(value, item, ret)`

In other words, all the dtype conversion is done by the defined dtype. The code would be lot more complicated if it decided 'on the fly' how to convert the value (and all previously allocated ones). Most of the code in this function deals with allocating the data buffer.

I'll hold off on looking up np.array. I'm sure it is much more complex.

Boozer answered 1/12, 2015 at 23:48 Comment(2)
Later I'll check some of these tests with the count specified (since the length is determined on the fly by add 50% of the allocated array length each time it runs into the end of the array). The only issue I have is that old reasons for making dtype required were never actually given… your tests give a clue, but given that you can compute a partially ordered set on dtype representations(I think?), you would be able to infer it efficiently (working upward from uint8/int8 to object, using C underpinnings of astype at reallocation time). But that would make np.fromiter really complex…Batik
I found the fromiter code; it's quite simple. dtype conversions are handled by the predefined dtype object. fromiter just iterates, and keeps the data buffer large enough.Boozer

© 2022 - 2025 — McMap. All rights reserved.