NumPy "record array" or "structured array" or "recarray"
Asked Answered
C

2

32

What, if any, is the difference between a NumPy "structured array", a "record array" and a "recarray"?

The NumPy docs imply that the first two are the same: if they are, which is the prefered term for this object?

The same documentation says (at the bottom of the page): You can find some more information on recarrays and structured arrays (including the difference between the two) here. Is there a simple explanation of this difference?

Catechist answered 17/1, 2015 at 1:10 Comment(5)
Structured arrays (aka “Record arrays”)Orit
I've clarified the question, @Ashwini Chaudhary - thanks.Catechist
What's unclear about the explanation of the difference on the docs? recarray supports access to fields in arr.foo form, while normal structured arrays support access only via arr['foo'] format, but is faster to look up. I would never call "structured arrays" "record arrays", precisely because it causes so much potential confusion with "recarrays".Professorship
For example, what is the history of the two sorts of record array? Are they completely different implementations or do they share underlying code? Why given, the attribute access overhead, would I want to use a recarray?Catechist
Noob here.... For me recarrays allow for an added level of flexibility when you wish to access data from arrays with many fields/columns. Access can be via my_array['DataField'] or by array-dot- field notation my_array.DataField. I find this an added bonus and a step up from arrays where you have to rely on slicing by field position using numbers since I can never remember which column they are in.Superdreadnought
R
18

Records/recarrays are implemented in

https://github.com/numpy/numpy/blob/master/numpy/core/records.py

Some relevant quotes from this file

Record Arrays Record arrays expose the fields of structured arrays as properties. The recarray is almost identical to a standard array (which supports named fields already) The biggest difference is that it can use attribute-lookup to find the fields and it is constructed using a record.

recarray is a subclass of ndarray (in the same way that matrix and masked arrays are). But note that its constructor is different from np.array. It is more like np.empty(size, dtype).

class recarray(ndarray):
    """Construct an ndarray that allows field access using attributes.
    This constructor can be compared to ``empty``: it creates a new record
       array but does not fill it with data.

The key function for implementing the unique field as attribute behavior is __getattribute__ (__getitem__ implements indexing):

def __getattribute__(self, attr):
    # See if ndarray has this attr, and return it if so. (note that this
    # means a field with the same name as an ndarray attr cannot be
    # accessed by attribute).
    try:
        return object.__getattribute__(self, attr)
    except AttributeError:  # attr must be a fieldname
        pass

    # look for a field with this name
    fielddict = ndarray.__getattribute__(self, 'dtype').fields
    try:
        res = fielddict[attr][:2]
    except (TypeError, KeyError):
        raise AttributeError("recarray has no attribute %s" % attr)
    obj = self.getfield(*res)

    # At this point obj will always be a recarray, since (see
    # PyArray_GetField) the type of obj is inherited. Next, if obj.dtype is
    # non-structured, convert it to an ndarray. If obj is structured leave
    # it as a recarray, but make sure to convert to the same dtype.type (eg
    # to preserve numpy.record type if present), since nested structured
    # fields do not inherit type.
    if obj.dtype.fields:
        return obj.view(dtype=(self.dtype.type, obj.dtype.fields))
    else:
        return obj.view(ndarray)

It first it tries to get a regular attribute - things like .shape, .strides, .data, as well as all the methods (.sum, .reshape, etc). Failing that it then looks up the name in the dtype field names. So it is really just a structured array with some redefined access methods.

As best I can tell record array and recarray are the same.

Another file shows something of the history

https://github.com/numpy/numpy/blob/master/numpy/lib/recfunctions.py

Collection of utilities to manipulate structured arrays. Most of these functions were initially implemented by John Hunter for matplotlib. They have been rewritten and extended for convenience.

Many of the functions in this file end with:

    if asrecarray:
        output = output.view(recarray)

The fact that you can return an array as recarray view shows how 'thin' this layer is.

numpy has a long history, and merges several independent projects. My impression is that recarray is an older idea, and structured arrays the current implementation that built on a generalized dtype. recarrays seem to be kept for convenience and backward compatibility than any new development. But I'd have to study the github file history, and any recent issues/pull requests to be sure.

Reft answered 4/10, 2015 at 16:52 Comment(0)
N
24

The answer in a nutshell is you should generally use structured arrays rather than recarrays because structured arrays are faster and the only advantage of recarrays is to allow you to write arr.x instead of arr['x'], which can be a convenient shortcut, but also error prone if your column names conflict with numpy methods/attributes.

See this excerpt from @jakevdp's book for a more detailed explanation. In particular, he notes that simply accessing columns of structured arrays can be around 20x to 30x faster than accessing columns of recarrays. However, his example uses a very small dataframe with just 4 rows and doesn't perform any standard operations.

For simple operations on larger dataframes, the difference is likely to be much smaller although structured arrays are still faster. For example, here's are a structured and record array each with 10,000 rows (code to create the arrays from a dataframe borrowed from @jpp answer here).

n = 10_000
df = pd.DataFrame({ 'x':np.random.randn(n) })
df['y'] = df.x.astype(int)

rec_array = df.to_records(index=False)

s = df.dtypes
struct_array = np.array([tuple(x) for x in df.values], dtype=list(zip(s.index, s)))

If we do a standard operation such as multiplying a column by 2 it's about 50% faster for the structured array:

%timeit struct_array['x'] * 2
9.18 µs ± 88.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit rec_array.x * 2
14.2 µs ± 314 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Naoma answered 21/9, 2018 at 11:44 Comment(3)
Beauty. A solid statement on the question of a performance gap. To wit, it does exist, and you should probably be concerned.Antofagasta
When i use dtype=list(zip(s.index, s)) i get an np array of dtype object for all columns. Is there a way to do the conversion from a pd.DataFrame such that the np.array retains the dtypes of the original columns (e.g. String, Ints and floats?) rather than setting them all to Objects?Deuterogamy
@Spcoggthesecond I'm not sure I follow... with the above code I get x as float and y as int for both struct_arr & rec_array. If you aren't I don't know why but regular numpy arrays have an astype method for dtype conversions and you could also see here for conversion of types for rec arrays: #9949927Naoma
R
18

Records/recarrays are implemented in

https://github.com/numpy/numpy/blob/master/numpy/core/records.py

Some relevant quotes from this file

Record Arrays Record arrays expose the fields of structured arrays as properties. The recarray is almost identical to a standard array (which supports named fields already) The biggest difference is that it can use attribute-lookup to find the fields and it is constructed using a record.

recarray is a subclass of ndarray (in the same way that matrix and masked arrays are). But note that its constructor is different from np.array. It is more like np.empty(size, dtype).

class recarray(ndarray):
    """Construct an ndarray that allows field access using attributes.
    This constructor can be compared to ``empty``: it creates a new record
       array but does not fill it with data.

The key function for implementing the unique field as attribute behavior is __getattribute__ (__getitem__ implements indexing):

def __getattribute__(self, attr):
    # See if ndarray has this attr, and return it if so. (note that this
    # means a field with the same name as an ndarray attr cannot be
    # accessed by attribute).
    try:
        return object.__getattribute__(self, attr)
    except AttributeError:  # attr must be a fieldname
        pass

    # look for a field with this name
    fielddict = ndarray.__getattribute__(self, 'dtype').fields
    try:
        res = fielddict[attr][:2]
    except (TypeError, KeyError):
        raise AttributeError("recarray has no attribute %s" % attr)
    obj = self.getfield(*res)

    # At this point obj will always be a recarray, since (see
    # PyArray_GetField) the type of obj is inherited. Next, if obj.dtype is
    # non-structured, convert it to an ndarray. If obj is structured leave
    # it as a recarray, but make sure to convert to the same dtype.type (eg
    # to preserve numpy.record type if present), since nested structured
    # fields do not inherit type.
    if obj.dtype.fields:
        return obj.view(dtype=(self.dtype.type, obj.dtype.fields))
    else:
        return obj.view(ndarray)

It first it tries to get a regular attribute - things like .shape, .strides, .data, as well as all the methods (.sum, .reshape, etc). Failing that it then looks up the name in the dtype field names. So it is really just a structured array with some redefined access methods.

As best I can tell record array and recarray are the same.

Another file shows something of the history

https://github.com/numpy/numpy/blob/master/numpy/lib/recfunctions.py

Collection of utilities to manipulate structured arrays. Most of these functions were initially implemented by John Hunter for matplotlib. They have been rewritten and extended for convenience.

Many of the functions in this file end with:

    if asrecarray:
        output = output.view(recarray)

The fact that you can return an array as recarray view shows how 'thin' this layer is.

numpy has a long history, and merges several independent projects. My impression is that recarray is an older idea, and structured arrays the current implementation that built on a generalized dtype. recarrays seem to be kept for convenience and backward compatibility than any new development. But I'd have to study the github file history, and any recent issues/pull requests to be sure.

Reft answered 4/10, 2015 at 16:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.