System where 1 byte != 8 bit? [duplicate]
Asked Answered
S

9

98

All the time I read sentences like

don't rely on 1 byte being 8 bit in size

use CHAR_BIT instead of 8 as a constant to convert between bits and bytes

et cetera. What real life systems are there today, where this holds true? (I'm not sure if there are differences between C and C++ regarding this, or if it's actually language agnostic. Please retag if neccessary.)

Soong answered 1/4, 2011 at 16:16 Comment(23)
If you go around assuming all the world is on Intel, you'll be right 90% of the time. For now. But don't you want your code to work everywhere, and continue to work everywhere?Coss
The only current CPUs I'm aware of where CHAR_BIT may be other than 8 are DSPs which in many cases do not have byte addressable memory, so CHAR_BIT tends to be equal to the word size (e.g. 24 bits). Historically there were mainframes with 9 bit bytes (and 36 bit words) but I can't imagine there are too many of these still in use.Virulent
While you find that in the c or c++ standards, it actually has more to do with the architecture of the underlying chips than the programming language. As long as you're working on the [desk|lap|palm]top you're unlikely to run into an exception soon, but delve into the embedded world and get ready for a ride.Downbow
+1. Good question. Even I want to know that. I think it should be tagged with C++ and C (so I added these tags). After all, it mentions CHAR_BIT.Equanimity
IIRC, a byte was originally defined as the space needed for one character. Now, in C and C++, a char is by definition a byte - but there are also multi-byte character representations. Personally, my view is that the ancient history is precisely that - "byte" has meant "8 bits" for decades, and anyone who specifies that their platform has non-8-bit "bytes" (if any such platform exists these days) should really be using a different word.Giorgia
@Steve314: The original byte was 6 bits. I can't imagine anybody wanting to use that concept of a byte. A better (but still not quite right) definition of a "byte" is that it is the smallest addressable unit of data. There are plenty of valid reasons some computer may have a byte size other than 8 bits. As dmckee mentioned, just delve into the embedded world. And do buckle up. The ride is a bit wild.Jarrell
@Steve314: The fact that char is by definition a "byte" doesn't really make much difference. These are just two internal terms in C/C++ terminology, which can be used interchangeably. The "byte" from C/C++ terminology has no formal relation to the machine "byte".Haney
@Paul Tomblin: Your reasoning is backwards. The allowance that CHAR_BIT be something other than 8 is not planning for the future; it's dragging along ancient history. POSIX finally shut the door on that chapter by mandating CHAR_BIT==8, and I don't think we'll ever see a new general-purpose (non-DSP) machine with non-8-bit bytes. Even on DSPs, making char 16- or 32-bit is just laziness by the compiler implementors. They could still make char 8-bit if they wanted, and hopefully C2020 or so will force them to...Deepfreeze
@Steve314 "a byte was originally defined as the space needed for one character." A byte was and still is defined as the smallest addressable unit. ""byte" has meant "8 bits" for decades" No, a byte has meant smallest addressable unit for decades. "Octet" has meant "8 bits" for decades.Genevivegenevra
@R.. "making char 16- or 32-bit is just laziness by the compiler implementors" Nonsense. Making char the size of the smallest natively addressable unit is sticking to the hardware, which is exactly what people expect. C and C++ are not Java.Genevivegenevra
@AndreyT "The "byte" from C/C++ terminology has no formal relation to the machine "byte"." Nonsense. A C/C++ byte is expected to exactly match the machine's byte, unless the machine's byte has less that 8 bits.Genevivegenevra
@Genevivegenevra - the term "byte" is usually credited to Werner Buchholtz in IBM in 1956, who described it as meaning "a group of bits used to encode a character, or the number of bits transmitted in parallel to and from input-output units". It originally had nothing to do with addressable units - that meaning was adopted in various contexts (e.g. C standard) despite "word" being defined as the "smallest addressable unit". Language evolves, and virtually everyone using the word "byte" in the recent decades meant 8 bits. If you don't believe me, check a few dictionaries.Giorgia
@curiousguy: These days computers actually talk to one another. Having a byte that's anything other than an octet does nothing but severely break this important property. Same goes for using other backwards things like EBCDIC.Deepfreeze
@R.. "These days computers actually talk to one another. Having a byte that's anything other than an octet does nothing but severely break this important property" So you are not able to explain which problems it might cause.Genevivegenevra
@curiousguy: Information interchange is all octet-based. If a byte is larger than 8 bits, then the implementation's fputc (and other output functions) must write more than a single octet to the communications channel (file/socket/etc.). This means fputc('A', f) will write at least two octets, leaving extra junk (likely one or more null octets) for the recipient to read. Conversely, fgetc would read multiple octets from the communication channel as a single char for the host, requiring further ugly processing to break it apart to use it. This is a brief summary in limited comment space.Deepfreeze
@R.. "must write more than a single octet to the communications channel" not if they are expected to be useful.Genevivegenevra
To be conformant, it must. Round trip fgetc and fputc must be value-preserving. Not to mention, if it didn't, saving/loading binary data would not work.Deepfreeze
@curiousguy: Absolutely incorrect. For example, hardware platforms that have 32-bit minimal addressable unit don't refer to these units as "bytes". C/C++ implementations on such platform will normally use 32-bit chars, which will represent "bytes" in C/C++sense of the term.Haney
@AndreyT "For example, hardware platforms that have 32-bit minimal addressable unit don't refer to these units as "bytes"." so they call it...?Genevivegenevra
@curiousguy: Words. They call it words. Four-byte words, to be precise. The entire "minimal addressable unit (MAU)" is also used from time to time by those who don't want to feel like they are tying the notion of "word" to the addressing properties of the hardware platform.Haney
AndreyT: C++ spec §1.8/1 says "Every byte has a unique address.". @curiousguy: see nothing in the spec says that address has to be a native address, I see nothing prohibiting a compiler from emulating 8bit bytes on systems with 16+bit MAU. R: C++ doesn't specify that, and IO can work with any number of bits, just requires translation, similar to endian issues. I've done it.Audient
@Mooing Bit-width translation can be done ad-hoc for specific systems, but I don't think the C/C++ standards specify enough to guarantee portable translation in general. When CHAR_BIT values differ, suddenly we also have to worry about bit-ordering. The order of bytes in an I/O stream is well-defined, but the order of bits in an I/O stream is not specified AFAIK. If one machine outputs 60 bits to an I/O stream and the other reads 8, which 8 are they? What happens to the 4 left over (60-7*8)? I just give up and require CHAR_BIT = 8 for cross-machine I/O, which works most of the time.Gabriellia
I found an IBM reference COMPUTER USAGE COMMINUQUÉ Vol. 2 No. 3 from 1963 that uses the term "byte" to refer to a variable 1 to 8 bits. Until now I had always thought that 1 byte = 8 bits, although character sizes and word sizes could be different.Bruell
C
80

On older machines, codes smaller than 8 bits were fairly common, but most of those have been dead and gone for years now.

C and C++ have mandated a minimum of 8 bits for char, at least as far back as the C89 standard. [Edit: For example, C90, §5.2.4.2.1 requires CHAR_BIT >= 8 and UCHAR_MAX >= 255. C89 uses a different section number (I believe that would be §2.2.4.2.1) but identical content]. They treat "char" and "byte" as essentially synonymous [Edit: for example, CHAR_BIT is described as: "number of bits for the smallest object that is not a bitfield (byte)".]

There are, however, current machines (mostly DSPs) where the smallest type is larger than 8 bits -- a minimum of 12, 14, or even 16 bits is fairly common. Windows CE does roughly the same: its smallest type (at least with Microsoft's compiler) is 16 bits. They do not, however, treat a char as 16 bits -- instead they take the (non-conforming) approach of simply not supporting a type named char at all.

Charr answered 1/4, 2011 at 16:27 Comment(8)
I'll accept this answer because it puts everything important into one place. Maybe also add that bit from larsmans comment that CHAR_BIT is also self-documenting, which also made me use it now. I like self-documenting code. :) Thanks everyone for their answers.Soong
Could you please quote where in the C89 Standard it says char must be minimum of 8 bits?Equanimity
@Nawaz: I don't have C89 handy, but C99 section 5.2.4.2.1 says regarding the values in <limits.h> that "implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign." -- and then says that CHAR_BIT is 8. In other words, larger values are compliant, smaller ones are not.Jarrell
@David: Hmm, I had read that, but didn't interpret it that way. Thanks for making me understand that.Equanimity
Wow +1 for teaching me something new about how broken WinCE is...Deepfreeze
there is something seriously wrong with Windows CEMozellemozes
@Jerry, you sure about char and WinCE? I wrote a bit for WinCE 5.0 /x86 and /ARM; there was nothing wrong with char type. What they did is remove char-sized versions of Win32 API (so GetWindowTextW is there but GetWindowTextA is not etc.)Menial
@atzz: Availability (or lack of it) of char obviously depends on the compiler, not the OS itself. I (at least think I) remember one of the early compilers for CE lacking char, but it's been quite a while since I wrote any code for CE, so I can't really comment on anything current (or close to it).Charr
Y
24

TODAY, in the world of C++ on x86 processors, it is pretty safe to rely on one byte being 8 bits. Processors where the word size is not a power of 2 (8, 16, 32, 64) are very uncommon.

IT WAS NOT ALWAYS SO.

The Control Data 6600 (and its brothers) Central Processor used a 60-bit word, and could only address a word at a time. In one sense, a "byte" on a CDC 6600 was 60 bits.

The DEC-10 byte pointer hardware worked with arbitrary-size bytes. The byte pointer included the byte size in bits. I don't remember whether bytes could span word boundaries; I think they couldn't, which meant that you'd have a few waste bits per word if the byte size was not 3, 4, 9, or 18 bits. (The DEC-10 used a 36-bit word.)

Younts answered 1/4, 2011 at 16:22 Comment(4)
Strings on the CDC were normally stored 10 bit characters to the word though, so it's much more reasonable to treat it as having a 6-bit byte (with strings normally allocated in 10-byte chunks). Of course, from a viewpoint of C or C++, a 6-bit byte isn't allowed though, so you'd have had to double them up and use a 12-bit word as "byte" (which would still work reasonably well -- the PPUs were 12-bit processors, and communication between the CPU and PPUs was done in 12-bit chunks.Charr
When I was doing 6600, during my undergrad days, characters were still only 6 bits. PASCAL programmers had to be aware of the 12-bit PP word size, though, because end-of-line only occurred at 12-bit boundaries. This meant that there might or might not be a blank after the last non-blank character in the line, and I'm getting a headache just thinking about it, over 30 years later.Younts
Holy cow what a blast from the past! +1 for the memories!Sherlocke
"TODAY, in the world of C++ on x86 processors" - You might want to talk to TI, Analog Devices (which have 16 bit DSPs), Freescale/NXP (24 bit DSPs), ARM, MIPS (both not x86), etc. In fact x86 is a minority of architectures and devices sold. But yes, a binary digital computer hardly has **trinary**(/etc.) digits.Fortress
D
15

Unless you're writing code that could be useful on a DSP, you're completely entitled to assume bytes are 8 bits. All the world may not be a VAX (or an Intel), but all the world has to communicate, share data, establish common protocols, and so on. We live in the internet age built on protocols built on octets, and any C implementation where bytes are not octets is going to have a really hard time using those protocols.

It's also worth noting that both POSIX and Windows have (and mandate) 8-bit bytes. That covers 100% of interesting non-embedded machines, and these days a large portion of non-DSP embedded systems as well.

Deepfreeze answered 8/9, 2011 at 2:23 Comment(12)
What real problems have you seen when a machine where a byte is not an octet communicates on the Internet?Genevivegenevra
I don't think that networking will be much of a problem. The very low level calls should be taking care of the encoding details. Any normal program should not be affected.Aspirator
They can't. getc and putc have to preserve unsigned char values round-trip, which means you can't just have "extra bits" in char that don't get read/written.Deepfreeze
They probably use C99's uint8_t. Since POSIX requires it for all socket functions it should be available (although C99 does not require uint8_t to be defined).Gertrudgertruda
uint8_t cannot exist if char is larger than 8 bits, because then uint8_t would have padding bits, which are not allowed.Deepfreeze
@R..: $7.20.1.1.2 (c11) says explicitly that there are no padding bits in uintN_t. $7.20.1.1.3 says "these types are optional." $3.6 defines byte as: "addressable unit of data storage large enough to hold any member of the basic character set of the execution environment" (I don't see the word "smallest" in the definition). There is a notion of internal vs. trailing padding. Can uint8_t have a trailing padding? Is there a requirement that uint8_t object is at least CHAR_BIT? (as it is with _Bool type).Manana
@J.F.Sebastian: I have no idea where your notion of "trailing padding" came from or what it would mean. Per Representation of Types all objects have a representation which is an overlaid array unsigned char[sizeof(T)] which may consist partly of padding.Deepfreeze
@R.. Have you tried to look up the phrase "trailing padding" in the current C standard? (I did it using n1570 draft)Manana
@J.F.Sebastian: Yes. It's only used in relation to structures; it means padding after any of the members that exists as part of bringing the total struct size up to the total size it needs to be (usually just enough to increase the size to a multiple of its alignment). It has nothing to do with non-aggregate types.Deepfreeze
@R.. One thing I don't get about your "they can't [communicate on the internet]" comment that I don't get, is that you reference getc and putc, but are those strongly relevant to the question of accessing the internet? Doesn't almost everything in the world access the internet through interfaces outside of the standard C library? Last I checked, you couldn't even get a stdio.h compatible object pointing to a network connection without first going through system-specific interfaces, could you? So is there any reason why details of getc/etc would preclude access to the internet?Spruill
@R.. I also see no conceptual problem with a hypothetical system avoiding dropping higher bits from a char, so long as at the lowest level interface they either require data reads/writes in multiples of 8-bits at a time (2 12-bit bytes would give you a clean 3-octet boundary, 8 9-bit bytes would give you a clean 9-octet boundary, etc) or buffered accordingly. I suspect such systems do not exist due to lack of demand, but is there any reason why it'd be an impossibility?Spruill
Lest my comments give the opposite impression, though, +1 to this answer, because it explains why it's reasonable to expect an 8-bit byte in all of the circumstances when it is, which really helps answering the fundamental question of "when should I worry about bytes/char being anything other than an octet" in a way that mere examples can't.Spruill
S
7

From Wikipedia:

The size of a byte was at first selected to be a multiple of existing teletypewriter codes, particularly the 6-bit codes used by the U.S. Army (Fieldata) and Navy. In 1963, to end the use of incompatible teleprinter codes by different branches of the U.S. government, ASCII, a 7-bit code, was adopted as a Federal Information Processing Standard, making 6-bit bytes commercially obsolete. In the early 1960s, AT&T introduced digital telephony first on long-distance trunk lines. These used the 8-bit µ-law encoding. This large investment promised to reduce transmission costs for 8-bit data. The use of 8-bit codes for digital telephony also caused 8-bit data "octets" to be adopted as the basic data unit of the early Internet.

Succentor answered 1/4, 2011 at 16:20 Comment(1)
This is not an answer to the question, just a vaguely related historical note.Centiare
B
6

As an average programmer on mainstream platforms, you do not need to worry too much about one byte not being 8 bit. However, I'd still use the CHAR_BIT constant in my code and assert (or better static_assert) any locations where you rely on 8 bit bytes. That should put you on the safe side.

(I am not aware of any relevant platform where it doesn't hold true).

Bystreet answered 1/4, 2011 at 16:20 Comment(2)
Besides being safe, CHAR_BIT is self-documenting. And I learned on SO that some embedded platforms apparently have 16-bit char.Ditzel
I realize that CHAR_BIT is meant to represent the byte size, but the beef I have with that term is that it really has less to do with chars and more to do with byte length. A newbie dev will likely read CHAR_BIT and think it has something to do with using UTF8 or something like that. It's an unfortunate piece of legacy IMO.Mccreery
H
4

Firstly, the number of bits in char does not formally depend on the "system" or on "machine", even though this dependency is usually implied by common sense. The number of bits in char depends only on the implementation (i.e. on the compiler). There's no problem implementing a compiler that will have more than 8 bits in char for any "ordinary" system or machine.

Secondly, there are several embedded platforms where sizeof(char) == sizeof(short) == sizeof(int) , each having 16 bits (I don't remember the exact names of these platforms). Also, the well-known Cray machines had similar properties with all these types having 32 bits in them.

Haney answered 19/6, 2011 at 15:56 Comment(8)
While you can technically do anything you want when implementing a compiler, in a practical sense you need to conform to the operating system's ABI, and this generally forces all compilers for a particular system to use the same data representations.Expatiate
@Barmar: The need to conform to the operating systems ABI applies to interface data formats only. It does not impose any limitations onto the internal data formats of the implementation. The conformance can be (and typically is) achieved by using properly selected (and possible non-standard) types to describe the interface. For example, boolean type of Windows API (hiding behind BOOL) is different from bool of C++ or C. That does not create any problems for implementations.Haney
Many APIs and ABIs are specified in terms of standard C data types, rather than abstract types. POSIX has some abstract types (e.g. size_t), but makes pretty liberal use of char and int as well. The ABI for particular POSIX implementations must then specify how these are represented so that interfaces will be compatible across implementations (you aren't required to compiler applications with the same implementation as the OS).Expatiate
@Barmar: That is purely superficial. It is not possible to specify ABIs in terms of truly standard language-level types. Standard types are flexible by definition, while ABI interface types are frozen. If some ABI uses standard type names in its specification, it implies (and usually explicitly states) that these types are required to have some specific frozen representation. Writing header files in terms of standard types for such ABIs will only work for those specific implementation that adhere to the required data format.Haney
Note that for the actual implementation "ABI in terms of standard types" will simply mean that some header files are written in therms of standard types. However, this does not in any way preclude the implementation from changing the representation of standard types. The implementation just has to remember that those header files have to be rewritten in terms of some other types (standard or not) to preserve the binary compatibility.Haney
For example, today I specify some ABI in terms of type int and presume (and explicitly state) that in this ABI int has 32 bits. Tomorrow my compiler gets significantly upgraded and its int changes from 32 bits to 64 bits. To preserve binary compatibility all I have to do in this case is replace int with int32_t in that ABI's header files. I don't even have to change the documentation of the ABI, since it explicitly states that it expects 32-bit int.Haney
@AndreyT: not in C++, you can't ...Waft
@SamB: Huh? I "can't" what exactly?Haney
B
2

I do a lot of embedded and currently working on DSP code with CHAR_BIT of 16

Bumboat answered 1/4, 2011 at 16:25 Comment(1)
Yes, and there are still a few 24-bit DSPs around.Younts
N
2

In history, there's existed a bunch of odd architectures that where not using native word sizes that where multiples of 8. If you ever come across any of these today, let me know.

  • The first commerical CPU by Intel was the Intel 4004 (4-bit)
  • PDP-8 (12-bit)

The size of the byte has historically been hardware dependent and no definitive standards exist that mandate the size.

It might just be a good thing to keep in mind if your doing lots of embedded stuff.

Nettie answered 1/4, 2011 at 16:35 Comment(0)
F
1

Adding one more as a reference, from Wikipedia entry on HP Saturn:

The Saturn architecture is nibble-based; that is, the core unit of data is 4 bits, which can hold one binary-coded decimal (BCD) digit.

Formerly answered 11/9, 2013 at 14:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.