Why do we need endianness here?
Asked Answered
I

2

11

I am reading a source-code which downloads the zip-file and reads the data into numpy array. The code suppose to work on macos and linux and here is the snippet that I see:

def _read32(bytestream):
    dt = numpy.dtype(numpy.uint32).newbyteorder('>')
    return numpy.frombuffer(bytestream.read(4), dtype=dt)

This function is used in the following context:

with gzip.open(filename) as bytestream:
    magic = _read32(bytestream)

It is not hard to see what happens here, but I am puzzled with the purpose of newbyteorder('>'). I read the documentation, and know what endianness mean, but can not understand why exactly developer added newbyteorder (in my opinion it is not really needed).

Iconoclasm answered 13/11, 2015 at 12:1 Comment(0)
C
8

That's because data downloaded is in big endian format as described in source page: http://yann.lecun.com/exdb/mnist/

All the integers in the files are stored in the MSB first (high endian) format used by most non-Intel processors. Users of Intel processors and other low-endian machines must flip the bytes of the header.

Cartier answered 13/11, 2015 at 12:21 Comment(2)
If you take a look at the code at line 45 you see ` data = numpy.frombuffer(buf, dtype=numpy.uint8)`. This make mess the things a little bit. Why does in this line of code the endianness is not specified?Tridactyl
Because data type uint8 is just 1 byte long. Endianness is meaningful only for multi-byte data types.Cartier
A
4

It is just a way of ensuring that the bytes are interpreted from the resulting array in the correct order, regardless of a system's native byteorder.

By default, the built in NumPy integer dtypes will use the byteorder that is native to your system. For example, my system is little-endian, so simply using the dtype numpy.dtype(numpy.uint32) will mean that values read into an array from a buffer with the bytes in big-endian order will not be interpreted correctly.

If np.frombuffer is to meant to recieve bytes that are known to be in a particular byteorder, the best practice is to modify the dtype using newbyteorder. This is mentioned in the documents for np.frombuffer:

Notes

If the buffer has data that is not in machine byte-order, this should be specified as part of the data-type, e.g.:

>>> dt = np.dtype(int)
>>> dt = dt.newbyteorder('>')
>>> np.frombuffer(buf, dtype=dt)

The data of the resulting array will not be byteswapped, but will be interpreted correctly.

Adulthood answered 13/11, 2015 at 12:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.