Difference between Big Endian and little Endian Byte order
Asked Answered
A

6

83

What is the difference between Big Endian and Little Endian Byte order ?

Both of these seem to be related to Unicode and UTF16. Where exactly do we use this?

Asbestos answered 31/3, 2009 at 15:37 Comment(6)
en.wikipedia.org/wiki/EndiannessStannum
Don't forget about MIDDLE endian. It's on the wiki page.Nuri
@Mitch: the same can be said for just about any question.Linnette
@Jon B: Yes, it can, but some questions are better answered by sustained research rather than a couple of answers that some experts gave.Breena
@BALAMURUGAN: BigEndian and Little Endian only comes when there is an multibyte data.Smilacaceous
Nicely explained betterexplained.com/articles/…Integumentary
B
138

Big-Endian (BE) / Little-Endian (LE) are two ways to organize multi-byte words. For example, when using two bytes to represent a character in UTF-16, there are two ways to represent the character 0x1234 as a string of bytes (0x00-0xFF):

Byte Index:      0  1
---------------------
Big-Endian:     12 34
Little-Endian:  34 12

In order to decide if a text uses UTF-16BE or UTF-16LE, the specification recommends to prepend a Byte Order Mark (BOM) to the string, representing the character U+FEFF. So, if the first two bytes of a UTF-16 encoded text file are FE, FF, the encoding is UTF-16BE. For FF, FE, it is UTF-16LE.

A visual example: The word "Example" in different encodings (UTF-16 with BOM):

Byte Index:   0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
------------------------------------------------------------
ASCII:       45 78 61 6d 70 6c 65
UTF-16BE:    FE FF 00 45 00 78 00 61 00 6d 00 70 00 6c 00 65
UTF-16LE:    FF FE 45 00 78 00 61 00 6d 00 70 00 6c 00 65 00

For further information, please read the Wikipedia page of Endianness and/or UTF-16.

Beefburger answered 31/3, 2009 at 15:41 Comment(4)
Endianness is perpetually counter-intuitive in that BE stores the most significant byte in the smallest address, not the last/end address. Whatever. This site makes things clear, Big and Little Endian, In big endian, you store the most significant byte in the smallest address. BTW, the Visual Example was helpful.Darryldarryn
If you wish, you could change around the words to make more sense: [In big endian, you store the most significant byte in the smallest address.] OR [In big endian, you store the least significant byte in the largest address.] Same thingHalfwit
Link broken @Darryldarryn this one works: cs.umd.edu/~meesh/cmsc311/clin-cmsc311/Lectures/lecture6/…Biopsy
I still don't see how does this answer help decide BE or LE i should choose for my subtitle files.Anderer
S
42

Ferdinand's answer (and others) are correct, but incomplete.

Big Endian (BE) / Little Endian (LE) have nothing to do with UTF-16 or UTF-32. They existed way before Unicode, and affect how the bytes of numbers get stored in the computer's memory. They depend on the processor.

If you have a number with the value 0x12345678 then in memory it will be represented as 12 34 56 78 (BE) or 78 56 34 12 (LE).

UTF-16 and UTF-32 happen to be represented on 2 respectively 4 bytes, so the order of the bytes respects the ordering that any number follows on that platform.

Spatiotemporal answered 24/7, 2009 at 8:30 Comment(0)
H
8

UTF-16 encodes Unicode into 16-bit values. Most modern filesystems operate on 8-bit bytes. So, to save a UTF-16 encoded file to disk, for example, you have to decide which part of the 16-bit value goes in the first byte, and which goes into the second byte.

Wikipedia has a more complete explanation.

Hogue answered 31/3, 2009 at 15:44 Comment(4)
this answer is incorrect. endianess is related to the underlying hardware architectureStannum
You can store a UTF-16 encoded file in either byte order regardless of the underlying hardware.Hogue
Given in the context of the question, this answer is perfectly acceptable IMHOBenniebenning
@joev: Exactly. It often is related to hardware architecture, but needn't necessarily be. For cross-platform compatibility, Unicode encoders/decoders should therefore be able to use either endianness.Secant
M
7

little-endian: adj.

Describes a computer architecture in which, within a given 16- or 32-bit word, bytes at lower addresses have lower significance (the word is stored ‘little-end-first’). The PDP-11 and VAX families of computers and Intel microprocessors and a lot of communications and networking hardware are little-endian. The term is sometimes used to describe the ordering of units other than bytes; most often, bits within a byte.

big-endian: adj.

[common; From Swift's Gulliver's Travels via the famous paper On Holy Wars and a Plea for Peace by Danny Cohen, USC/ISI IEN 137, dated April 1, 1980]

Describes a computer architecture in which, within a given multi-byte numeric representation, the most significant byte has the lowest address (the word is stored ‘big-end-first’). Most processors, including the IBM 370 family, the PDP-10, the Motorola microprocessor families, and most of the various RISC designs are big-endian. Big-endian byte order is also sometimes called network order.

---from the Jargon File: http://catb.org/~esr/jargon/html/index.html

Merkel answered 4/5, 2010 at 15:37 Comment(0)
S
1

Byte endianness (big or little) needs to be specified for Unicode/UTF-16 encoding because for character codes that use more than a single byte, there is a choice of whether to read/write the most significant byte first or last. Unicode/UTF-16, since they are variable-length encodings (i.e. each char can be represented by one or several bytes) require this to be specified. (Note however that UTF-8 "words" are always 8-bits/one byte in length [though characters can be multiple points], therefore there is no problem with endianness.) If the encoder of a stream of bytes representing Unicode text and the decoder aren't agreed on which convention is being used, the wrong character code can be interpreted. For this reason, either the convention of endianness is known beforehand or more commonly a byte order mark is usually specified at the beginning of any Unicode text file/stream to indicate whethere big or little endian order is being used.

Secant answered 31/3, 2009 at 15:45 Comment(10)
this answer is incorrect. endianess is related to the underlying hardware architectureStannum
UTF-8 is a variable-length encoding, using 1-6 bytes per character and is thus not fixed to a single byte as stated here!Beefburger
Right, so I haven't stated that endianness depends on hardware architecture, but I don't see how my answer is explicitly incorrect. Consider that text files written/read on different architectures must have their endianness known.Secant
@Ferdinand: You are correct - I should mention that some variants of UTF-8 do not require it...Secant
Sorry, you still don't got it right. There are no variants of UTF-8 that don't require multiple bytes. If you only use ASCII characters, UTF-8 will represent them using single bytes. All characters with character code >127 will be encoded using multiple bytes!Beefburger
Just for completeness - UTF-8 requires between 1 and 4 bytes. Valid UTF-8 cannot contain more than 4 bytes.Launcher
@Noldorin: As I said, ASCII characters use a single byte. This is a property of UTF-8, not a variant! Using single-bytes, you cannot encode Non-ASCII unicode values.Beefburger
@Ferdinand: Yes, I've realised that since your original correction. Post has been clarified again, as I see your point... though I think I somewhat confused myself in the process of correcting myself. :PSecant
(contd.) I think I'm right in saying that because "words" in UTF-8 are 8-bits/one byte long (invariably, despite the variable length of char codes), then there is no problem with endianness at least.Secant
unicode.org/faq/utf_bom.html seems to agree, though again correct me if I'm wrong...Secant
H
0

A neat way to remember which is which is to look at the words BIG ENDian and LITTLE ENDian.

Big Endian stores the BIGGEST END at the beginning. Just like it's spelt -Big ENDian

Little Endian stores the LITTLEST END at the beginning. Just like it's spelt -Little ENDian

By big and little I mean the significance. Big is the most significant. Little is the least significant.

Example: to store the value "258" in hex would look like 0102 in big endian, and 0201 in little endian. Add 1 to it and it becomes 259 (Big: 0103 , Little: 0301).

Least significant number is the quickest to change. Most significant would take a lot to change. Like 1,000,000: all the zero's would change before the one does. The millionth is the most significant digit in that example. The zero's are less significant because it takes less to change those.

Analogy time:

The american way to write dates is like Big endian (02/22 = feb 22 where we put the larger significant number [the month] first)

and the -other- way to write dates is like Little Endian (22/02 = 22 feb where they put the least significant number [the day] first. like writing a million like 000,000,1)

OPINION: The best way to write dates would be YYYY/MM/DD-HH:MM:SS (using the 24-hour time). There's no confusion, it's perfect for sorting by age because year is the most significant number here, then month, then day, then hour, then minute, finally second. This would be BIG ENDIAN.

Housebound answered 26/4 at 20:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.