Confusing behavior of sizeof with chars [duplicate]
Asked Answered
I

5

44
#include <stdio.h>
#include <string.h>

int main(void)
{
    char ch='a';

    printf("sizeof(ch)          = %d\n", sizeof(ch));
    printf("sizeof('a')         = %d\n", sizeof('a'));
    printf("sizeof('a'+'b'+'C') = %d\n", sizeof('a'+'b'+'C'));
    printf("sizeof(\"a\")       = %d\n", sizeof("a"));
}

This program uses sizeof to calculate sizes. Why is the size of 'a' different from the size of ch (where ch='a')?

sizeof(ch)          = 1
sizeof('a')         = 4
sizeof('a'+'b'+'C') = 4
sizeof("a")         = 2
Isopropyl answered 4/7, 2018 at 12:41 Comment(13)
You should be using %zu as sizeof returns size_t not intAfghanistan
You need to tag this either C or C++, because this code will give very different answers depending on language. Basically C++ recognized that C was being stupid and fixed various obvious language flaws, while C refuses to admits that it is stupid.Ambros
@Sam Varshavchik Not necessarily a dupe because the first two rows will give 1 vs 4 in C, but 1 vs 1 in C++. The 3rd row will indeed mess around with implicit promotion in C++, but not in C.Ambros
Odd interpretation of the word "duplicate" here. I've reopened. Disk is cheap. Search engines are powerful. Let's only close as duplicates if it's a duplicate.Budgerigar
in that case, maybe the question needs upvoting.Defraud
@Budgerigar "Disk is cheap" is a non-reason...Revivalism
@user202729: In your opinion, with respect. When researching, it's always good to have a selection of sources. This quixotic closing to broad so-called duplicates is the thing that makes no sense.Budgerigar
I can find 3 partial duplicate targets, each answer a part of the question. I am flagging to close as too broad.Revivalism
@Revivalism It’s not too broad. It asks a very specific, real-world question about software engineering. And it does not appear to have an exact duplicate.Saenz
@Saenz If it can be splitted into 3 different questions (each of which is on-topic for Stack Overflow), it's too broad.Revivalism
Also, it doesn't have an exact duplicate precisely because it's too broad, in this case.Revivalism
I don't get why is this duplicate. The answer to this question is "Because C characters literals are ints". "Why are C characters literals ints" is different question which I can not ask before I know that C characters literals are ints. Right? The seconds question implies that you already know the answer to the first question. But you don't.Hourigan
@Hourigan I searched a lot before asking this question .I aint sure what two users felt before marking this as duplicate.Isopropyl
W
52

TL;DR - sizeof works on the type of the operand.

  • sizeof(ch) == sizeof (char)-------------------(1)
  • sizeof('a') == sizeof(int) --------------------(2)
  • sizeof ('a'+ 'b' + 'c') == sizeof(int) ---(3)
  • sizeof ("a") == sizeof (char [2]) ----------(4)

Let's see each case now.

  1. ch is defined to be of char type, so , pretty straightforward.

  2. In C, sizeof('a') is the same as sizeof (int), as a character constant has type integer.

    Quoting C11,

    An integer character constant has type int. [...]

    In C++, a character literal has type char.

  3. sizeof is a compile-time operator (except when the operand is a VLA), so the type of the expression is used. As earlier , all the integer character constants are of type int, so int + int + int produces int. So the type of the operand is taken as int.

  4. "a" is an array of two chars, 'a' and 0 (null-terminator) (no, it does not decay to pointer to the first element of the array type), hence the size is the same as of an array with two char elements.


That said, finally, sizeof produces a result of type size_t, so you must use %zu format specifier to print the result.

Wicklow answered 4/7, 2018 at 12:43 Comment(10)
I wonder in what problems, if any, would have resulted from making sizeof operate only on lvalues? Especially on systems that use FLT_EVAL_METHOD==2 and have a long double type which is bigger than double, it would seem a bit weird to suggest that sizeof (1.0/10.0) should report 8 even if long double d = (1.0/10.0); would store a value that cannot be represented in an 8-byte "double".Fusty
Suggest "character constant has type integer" --> "character constant has type int".Thompkins
@Fusty Among other things it would probably have made the following construct illegal: sizeof(type).Trimetallic
@dgnuff: Mea culpa. What I meant was excluding the operator on values that aren't lvalues.Fusty
@supercat, relatively few problems, I suspect. A bit of cognitive dissonance, for one: it is nicely consistent that sizeof works on all expressions. As for practical programming issues, the one I see is the case where you want the size of a string literal, and especially where that literal is conveyed via a macro, so that its size may be changed at some location distant from its use -- maybe even on the compiler command line.Millicentmillie
@JohnBollinger: From what I can tell, what makes sizeof is useful with string literals is that they are char[] const lvalues (which would be allowed if sizeof were restricted to lvalues) rather than char const* values. As for having it work on "all" expressions, the & operator doesn't, so why should sizeof be special in that regard?Fusty
You're right, @supercat, a string literal is an lvalue, so that's not a problem. As for working with all expressions, however, I submit that many C operators work with any operand of suitable type. Among the unary operators, for example, there are -, !, and ~, and even the function-call operator, (). C requires operands to be lvalues only where it needs to refer to storage, which, of course, is what distinguishes lvalues from non-lvalue expressions. All expressions have sizes as determined by their types, whether or not they have any associated storage.Millicentmillie
@JohnBollinger: Within a function like void foo(someArrayType arr); the expression arr is clearly a value, but attempting to use sizeof on that type is unlikely to yield an intended result. IMHO, a clean way of preventing such nonsense would be to say that within such a function, the expression arr would not be an lvalue, but would instead yield a value of pointer-to-member type, but only if sizeof required an lvalue.Fusty
@supercat, I take your point, but I observe that C says that there are no functions with parameters of array type, because declarations that have a form that would declare other identifiers as arrays declare function parameters as pointers, instead. I think I understand why that decision was made, but I'd say that the nonsense in this area is in allowing such a deceptive form of declaration in the first place, not in handling the resulting parameters according to the type with which they are (actually) declared.Millicentmillie
@JohnBollinger: If a function's parameters is declared as being an array type, it may not be necessary to have the parameter "pre-decomposed" into a pointer, rather than having it be an array lvalue which gets converted into a pointer value in many contexts, but treating it as an lvalue of pointer type was an "unforced error".Fusty
B
23

In C, 'a' is constant of type int. It is not a char. So sizeof('a') will be the same as sizeof(int).

sizeof(ch) is the same as sizeof(char). (The C standard guarantees that all alphanumeric constants - and some others - of the form 'a' can fit into a char, so char ch='a'; is always well-defined.)

Note that in C++, 'a' is a literal of type char; yet another difference between C and C++.

In C, sizeof("a") is sizeof(char[2]) which is 2. sizeof does not instigate the decay of an array type to a pointer.

In C++, sizeof("a") is sizeof(const char[2]) which is 2. sizeof does not instigate the decay of an array type to a pointer.

In both languages, 'a'+'b'+'C' is an int type due, in C++, to implicit promotion of integral types.

Budgerigar answered 4/7, 2018 at 12:43 Comment(3)
Great answer but for the very minor issue of not being explicit about 'a'+'b'+'C' being an example of integral promotion, not integral conversion, in standard terms. (Both are conversions though, because this is also used as an umbrella term. The naming is… interesting.)Burgas
@ArneVogel: Thank you, if I had a dollar every time I say or write that incorrectly...Budgerigar
@chux Thanks, I’ve fixed but I think I’ll leave all the C++ stuff up - the joys of a moving question!Budgerigar
A
9

First of all, the result of sizeof is type size_t, which should be printed with the %zu format specifier. Ignoring that part and assuming int is 4 bytes, then

  • printf("sizeof(ch) %d\n",sizeof(ch)); will print 1 in C and 1 in C++.

    This is because a char is per definition guaranteed to be 1 byte in both languages.

  • printf("sizeof('a') %d\n",sizeof('a')); will print 4 in C and 1 in C++.

    This is because character literals are of type int in C, for historical reasons1), but they are of type char in C++, because that's what common sense (and ISO 14882) dictates.

  • printf("sizeof('a'+'b'+'C) %d\n",sizeof('a'+'b'+'C')); will print 4 in both languages.

    In C, the resulting type of int + int + int is naturally int. In C++, we have char + char + char. But the + invokes implicit type promotion rules so we end up with int in the end no matter.

  • printf("sizeof(\"a\") %d\n",sizeof("a")); will print 2 in both languages.

    The string literal "a" is of type char[] in C and const char[] in C++. In either case we have an array consisting of an a and a null terminator: two characters.

    As a side note, this happens because the array "a" does not decay into a pointer to the first element when operand to sizeof. Should we provoke an array decay by for example writing sizeof("a"+0), then we would get the size of a pointer instead (likely 4 or 8).


1) Somewhere in the dark ages there were no types and everything you wrote would boil down to int no matter. Then when Dennis Ritchie started to cook together some manner of de facto standard for C, he apparently decided that character literals should always be promoted to int. And then later when C was standardized, they said that character literals are simply int.

Upon creating C++, Bjarne Stroustrup recognize that all of this didn't make much sense and made character literals type char as they ought to be. But the C committee stubbornly refuses to fix this language flaw.

Ambros answered 4/7, 2018 at 13:13 Comment(12)
My copies of the C89 and C99 standard define sizeof to return counts of "storage units", not "bytes", whatever those are.Metallist
@EricTowers "byte" is today typically only used for 8 bits, but sizeof returns the number of chars - and a C char can be larger than 8 bits (it's 16 bits on a CPU I'm working with, for example).Arrivederci
@Arrivederci : And a byte has been 9-bits on architectures I've worked on. My point is that, since the standard does not define or use "byte"s it is incorrect to have "is per definition guaranteed to be 1 byte".Metallist
Detail: C standard does not have character literals. It does have character constants which are type int. C's 2 literals: string and compound can have their address taken, unlike constants.Thompkins
@EricTowers C11/C99 6.5.3.4/2 or C90 6.3.3.4 "The sizeof operator yields the size (in bytes) of its operand". Maybe cite the standard next time before making up such statements.Ambros
ISO/IEC 9899:1990 3.6. Summarizing: "bytes" != bytes. For more on this discrepancy, see misra.org.uk/forum/viewtopic.php?t=973Metallist
@EricTowers Yes I am well-aware that the standard allows a byte to be something else than 8 bits. Nothing in this answer contradicts that. As proven by quoting normative text in the 3 latest C standards, the sizeof operator returns the size in bytes. I have only ever spoken about bytes.Ambros
@Ambros : You have not spoken about bytes. You have spoken about "bytes". And as cited, using Standard meanings of plain language words in semantically conflatable settings is misleading.Metallist
@EricTowers What it boils down to is that anyone designing C programs for compatibility with wildly exotic DSP:s are wasting their time almost as much as people writing pedantic comments on internet sites along the lines of: "but a byte might be 57 bits!", "but an int might have padding bits and there will be trap representations!", "but this system might be a 33 bit CPU signed magnitude computer!" etc. Sure the standard allows it, but wasting energy caring about it is a huge waste of everyone's time. Focus on portability to mainstream computers.Ambros
What it boils down to is anyone reading the term "bytes" in your Answer without comment that the Standard doesn't mean bytes will think you mean bytes. While you say you are aware of this defect of the Standard, you are apparently unaware of the widespread confusion on the issue. It's related to your undocumented claim "Then when Dennis Ritchie started to cook together some manner of de facto standard for C, he apparently decided that character literals should always be promoted to int.", which is in direct contrast with the documented reason : the PDP-11 had no 8-bit GP register.Metallist
@Lundin: Unfortunately, the authors of the C Standard seem opposed to the idea of recognizing mainstream compilers and platforms, and would rather saddle the 99% of programs that nobody would ever have any interest in running on anything other than octet-based linear-address architectures with two's-complement silent-wraparound integer semantics, with the limitations of quirky architectures which in some cases might not even exist [e.g. those where left-shifting a negative number would do anything other than multiply be a power of two in cases where such a multiply would not overflow].Fusty
When C was first designed, there were only three kinds of values: pointers, integers, and double-precision floating-point. Evaluating an object would promote its value to the largest type of the appropriate kind. The type of a character literal value had to be int because there was no such thing as a value of char type.Fusty
S
2

As others have mentioned, the C language standard defines the type of a character constant to be int. The historical reason for this is that C, and its predecessor B, were originally developed on DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.

That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use one’s-complement math instead of two’s-complement for similar historical reasons, and the fact that character escapes default to octal and octal constants start with just 0 and hex needs \x or 0x is that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.

Automatic promotion to int causes nothing but trouble today. (How many programmers are aware that multiplying two uint32_t expressions together is undefined behavior, because some implementations define int as 64 bits wide, the language requires that any type of lower rank than int must be promoted to a signed int, the result of multiplying two int multiplicands has type int, the multiplication can overflow a signed 64-bit product, and this is undefined behavior?) But that’s the reason C and C++ are stuck with it.

Saenz answered 4/7, 2018 at 23:52 Comment(11)
Thanx..Good Research thoughIsopropyl
Note that the authors of the Standard have expressly recognized the possibility that an implementation may be conforming and yet be of such poor quality as to be useless, but assume that quality implementations won't go out of their way to behave in the least-useful fashion the Standard would permit. The Rationales for all versions of the Standard describe expressions where they would expect quality commonplace implementations to treat signed and unsigned math identically. The UB resulting from unwanted promotions to signs will only be a problem when using low-quality compilers...Fusty
...(which, for whatever reason, programmers have become all too willing to tolerate). The fact that a piece of code won't work on a compiler that designed to be of needlessly-poor quality doesn't mean the code is broken. It would be impossible to write any program that couldn't be sunk by a "conforming implementation" of sufficiently poor quality.Fusty
@Fusty It was your answer about how C is not a “safe” programming language that brought that example to mind. :)Saenz
@Fusty I agree that a lot of language-lawyering isn’t especially relevant to coding today. Sometimes, for fun, I point out loopholes based on the fact that one’s-complement or sign-and-magnitude arithmetic are still technically allowed. But they’re only used in a few mainframe architectures from the ’60s (although UNIVAC does still support one of those). Or that there is an implementation that supports EBCDIC as the source and execution character setSaenz
@Davislor: If the Standard recognized a concept of a "limited implementation" which can't support all features, but will reject programs that require or may require features it can't support, then a C99 implementation could be practical on the Univac. I don't think there's any practical way a ones'-complement or sign-magnitude machine can efficiently handle a uint64_least_t or unsigned long long type without also being able to efficiently process two's-complement arithmetic unless its basic word size was 65 bits or longer.Fusty
@Fusty It does recognize a distinction between a hosted and freestanding implementation, but yes. Emulating higher-precision math would be difficult, and you couldn’t have the exact-width types we were talking about anyway, because padding bits are not allowed.Saenz
@Davislor: The latest Univac C implementation I've read about supported a 72-bit "long long" type, but not an unsigned equivalent. The documentation didn't say how the "long long" was stored, but I would guess it probably used a non-binary representation with the upper word being (2**36-1) times the lower. Such an approach would be allowable for an extended signed integer type, but would not be allowable for an unsigned type.Fusty
@Fusty Interesting! I did not know that. But we’re getting off-topic.Saenz
@Davislor: My intended point was that the requirement that implementations must support a 64-bit unsigned data type made it impractical to produce a meaningfully-conforming C99 implementation on any existing sign-magnitude or ones'-complement hardware. As a consequence, concessions elsewhere in the Standard which were made to accommodate such machines serve no useful purpose unless weakening the language for no apparent reason is considered an "useful purpose".Fusty
Let us continue this discussion in chat.Saenz
P
0

I'm assuming the code was compiled in C.
In C, 'a' is treated as an int type and int has a size of 4. In C++, 'a' is treated as a char type, and if you try compiling your code in cpp.sh, it should return 1.

Punchboard answered 4/7, 2018 at 12:53 Comment(6)
"int has a size of 4" Usually, yes. But not always.Afghanistan
I have a platform where sizeof(int)==1.Arrivederci
@Arrivederci What platform/compiler is that with sizeof(int)==1?Thompkins
@chux: I believe Cray XMP used CHAR_BIT == 32 so sizeof(int) == 1. Most people don't have one of those kicking around any more — or the power or water supply necessary to keep it happy.Heidi
@chux A custom core in a small embedded chip where CHAR_BIT==16 and an int is 16 bits.Arrivederci
@chux: I've written code for a DSP where a char is a 16-bit signed integer type, and int is likewise. The only way the hardware could support changing an octet in memory would be to do a 16-bit load, change 8 bits of the loaded value, and then do a 16-bit store. It may have been possible for a compiler to generate code to process character-type writes with a read-modify-write sequence, but that would have made code that operates on a sequence of character-type values really slow.Fusty

© 2022 - 2024 — McMap. All rights reserved.