Why are C character literals ints instead of chars?

Asked 11/1, 2009 at 22:43 Answered 10/7, 2018 at 5:14

121

In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) == 1 as defined by the standard.

In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.

Neuroblast answered 11/1, 2009 at 22:43 Comment(6)

sizeof would just return the size of a byte wouldn't it? Aren't a char and an int equal in size? – Nancee 11/1, 2009 at 23:13

This is probably compiler (and architecture) dependent. Care to say what you're using? The standard (at least up to '89) was very loose. – Selinski 11/1, 2009 at 23:14

no. a char is always 1 byte large, so sizeof('a') == 1 always (in c++), while an int can theoretically be sizeof of 1, but that would require a byte having at least 16bits, which is very unlikely :) so sizeof('a') != sizeof(int) is very likely in C++ in most implementations – Quadruplex 11/1, 2009 at 23:30

... while it's always wrong in C. – Quadruplex 11/1, 2009 at 23:31

'a' is an int in C - period. C got there first - C made the rules. C++ changed the rules. You can argue that the C++ rules make more sense, but changing the C rules would do more damage than good, so the C standard committee wisely hasn't touched this. – Tearle 12/1, 2009 at 1:12

Jonathan, just to be clear - my "it's always wrong in C" isn't meant to say C is always wrong :) it means that sizeof('a') == sizeof(int) is always true in C . your comment sounds like you comment on something i said in my comment :) – Quadruplex 12/1, 2009 at 2:11

Discussion on same subject

"More specifically the integral promotions. In K&R C it was virtually (?) impossible to use a character value without it being promoted to int first, so making character constant int in the first place eliminated that step. There were and still are multi character constants such as 'abcd' or however many will fit in an int."

Sike answered 11/1, 2009 at 23:21 Comment(3)

Multi-character constants are not portable, even between compilers on a single machine (though GCC seems to be self-consistent across platforms). See: stackoverflow.com/questions/328215 – Tearle 12/1, 2009 at 1:10

I would note that a) This quotation is unattributed; the citation merely says "Would you disagree with this opinion, which was posted in a past thread discussing the issue in question?" ... and b) It is ludicrous, because a char variable is not an int, so making a character constant be one is a special case. And it's easy to use a character value without promoting it: c1 = c2;. OTOH, c1 = 'x' is a downward conversion. Most importantly, sizeof(char) != sizeof('x'), which is serious language botch. As for multibyte character constants: they're the reason, but they're obsolete. – Verdure 16/3, 2011 at 13:2

Related: What do we do with answers that are entirely copied and improperly attributed (only a "reference" link or similar is included)? – Stylo 10/3 at 1:4

The original question is "why?"

The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.

In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.

This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.

This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.

When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.

When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.

This is why they are different. Evolution...

Tenancy answered 23/4, 2014 at 16:4 Comment(2)

+1 from me for actually answering 'why?'. But I disagree with the last statement -- "The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures" -- in C++ it is still possible for 2 functions to have parameters of same size and different signatures, e.g. void f(unsigned char) Vs void f(signed char). – Dogmatize 7/2, 2017 at 17:37

@PeterK John could have put it better, but what he says is essentially accurate. The motivation for the change in C++ was, if you write f('a'), you probably want overload resolution to choose f(char) for that call rather than f(int). The relative sizes of int and char are not relevant, as you say. – Whet 9/2, 2017 at 18:53

I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:

void print(int);
void print(char);

print('a');

You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.

Quadruplex answered 11/1, 2009 at 23:26 Comment(2)

Yes, "Design and Evolution of C++" says overloaded input/output routines were the main reason C++ changed the rules. – Fudge 12/1, 2009 at 8:25

Max, yeah i cheated. i looked in the standard in the compatibility section :) – Quadruplex 12/1, 2009 at 10:55

Using GCC on my MacBook, I try:

#include <stdio.h>

#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
  test('a');
  test("a");
  test("");
  test(char);
  test(short);
  test(int);
  test(long);
  test((char)0x0);
  test((short)0x0);
  test((int)0x0);
  test((long)0x0);
  return 0;
};

which when run gives:

'a':    4
"a":    2
"":     1
char:   1
short:  2
int:    4
long:   4
(char)0x0:      1
(short)0x0:     2
(int)0x0:       4
(long)0x0:      4

which suggests that a character is 8 bits, like you suspect, but a character literal is an int.

Selinski answered 11/1, 2009 at 23:8 Comment(2)

+1 for being interesting. People often think that sizeof("a") and sizeof("") are char*'s and should give 4 (or 8). But in fact they're char[]'s at that point (sizeof(char[11]) gives 11). A trap for newbies. – Kwangchow 12/1, 2009 at 1:32

A character literal is not promoted to an int, it is already an int. There is no promotion going on whatsoever if the object is an operand of the sizeof operator. If there was, this would defeat sizeof's purpose. – Easley 12/1, 2009 at 3:40

Back when C was being written, the PDP-11's MACRO-11 assembly language had:

MOV #'A, R0      // 8-bit character encoding for 'A' into 16 bit register

This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:

MOV #"AB, R0     // 16-bit character encoding for 'A' (low byte) and 'B'

This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.

So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:

address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'

If you want to read just an 'A' from this main memory into a register, which one would you read?

Some CPUs may only directly support reading a 16 bit value into a 16-bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.

So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0—depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).

My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".

(See 6.3.3 on page 6-25 of PDP-11 MACRO-11 Language Reference Manual)

Bidentate answered 29/3, 2011 at 6:26 Comment(0)

I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. The code put the read character into an int, tested for EOF, and converted to a char if it wasn't.

I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.

int r;
char buffer[1024], *p; // Don't use in production - buffer overflow likely
p = buffer;

while ((r = getc(file)) != EOF)
{
  *(p++) = (char) r;
}

Impolitic answered 11/1, 2009 at 22:51 Comment(6)

I don't think 0 is a valid character though. – Olympiaolympiad 11/1, 2009 at 22:56

@gbjbaanb: Sure it is. It's the null character. Think about it. Do you think a file shouldn't be allowed to contain any zero bytes? – Cloison 11/1, 2009 at 23:0

A null-terminated file might make sense for textual data, but if it's binary I think \0 should be considered a valid value. – Impolitic 11/1, 2009 at 23:1

Read wikipedia - "The actual value of EOF is a system-dependent negative number, commonly -1, which is guaranteed to be unequal to any valid character code." – Sike 11/1, 2009 at 23:22

As Malx says - EOF is not a char type - it's an int type. getchar() and friends return an int, which can hold any char as well as EOF without conflict. This would really not require literal chars to have type int. – Comfort 12/1, 2009 at 0:0

EOF == -1 came long after C's character constants, so this is not an answer and not even relevant. – Verdure 16/3, 2011 at 13:4

I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):

In C, the type of a character literal such as 'a' is int. Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems. Except for the pathological example sizeof('a'), every construct that can be expressed in both C and C++ gives the same result.

So for the most part, it shouldn't cause any problems.

Comfort answered 11/1, 2009 at 23:53 Comment(1)

Interesting! Kinda contradicts what others were saying about how the C standards committee "wisely" decided not to remove this quirk from C. – Lauretta 12/1, 2009 at 8:42

The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.

That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use ones' complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hexadecimal is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks, but not four-byte nibbles.

Oringa answered 10/7, 2018 at 5:14 Comment(1)

... and char was exactly 3 octal digits long – Excavation 22/2, 2019 at 5:37

-1

This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).

Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:

Character literals have type int and they get there by following the rules for promotion from type char. This is too briefly covered in K&R 1, on page 39 where it says:

Every char in an expression is converted into an int....Notice that all float's in an expression are converted to double....Since a function argument is an expression, type conversions also take place when arguments are passed to functions: in particular, char and short become int, float becomes double.

Tullusus answered 11/1, 2009 at 23:45 Comment(8)

If the other comments are to be believed, the expression 'a' starts out with type int -- no type promotion is performed inside of a sizeof(). That 'a' has type int is just a quirk of C it seems. – Lauretta 12/1, 2009 at 8:41

A char literal does have type int. The ANSI/ISO 99 standard calls them 'integer character constants' (to differentiate them from 'wide character constants', which have type wchar_t) and specifically says, "An integer character constant has type int." – Comfort 12/1, 2009 at 16:49

What I meant was that it does not start with type int, but rather converted to an int from char (answer edited). Of course, this probably does not concern anyone except compiler writers since the conversion is always done. – Tullusus 13/1, 2009 at 1:40

No! If you read the ANSI/ISO 99 C standard you will find that in C, the expression 'a' starts with type int. If you have a function void f(int) and a variable char c, then f(c) will perform integral promotion, but f('a') won't as the type of 'a' is already int. Strange but true. – Lauretta 15/1, 2009 at 15:35

Unfortunately I don't have access to the standard. Anyway, C99 was after K&R 1, so I can only assume that was one of the silent changes. It makes no difference to programmers (even compiler writers) anyway. – Tullusus 16/1, 2009 at 5:18

The K&R quote is being misinterpreted. The character literal 'a' never had type char, it was always int, according to every C standard. – Plication 25/2, 2009 at 11:34

"Just to be sure" -- You could be more sure by actually reading the statement: "Character literals have type int". "I can only assume that was one of the silent changes" -- you assume wrongly. Character literals in C have always been of type int. – Verdure 16/3, 2011 at 13:15

-1 This answer is still incorrect and should get deleted. See C11 6.4.4.4/10: "An integer character constant has type int". Either "Deep Secrets" is incorrect or you just misunderstood it. – Cathicathie 30/6, 2014 at 8:14

-1

This is only tangential to the language specification, but in hardware the CPU usually only has one register size—32 bits, let's say—and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register.

The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.

It's sort of an academic point, because the language could have specified an 8-bit literal type anyway, but in this case the language specification happens to reflect more closely what the CPU is really doing.

(x86 wonks may note that there is eg a native addh opcode that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, and then extend sign, like an add/extsh pair on the PowerPC.)

Mouthwash answered 11/1, 2009 at 23:54 Comment(6)

Yet another wrong answer. The issue here is why character literals and char variables have different types. Automatic promotions, which reflect the hardware, aren't relevant -- they're actually anti-relevant, because char variables are automatically promoted so that's no reason for character literals not to be of type char. The real reason is multibyte literals, which are now obsolete. – Verdure 16/3, 2011 at 13:23

@Jim Balter Multibyte literals aren't obsolete at all; there's multibyte Unicode and UTF characters. – Mouthwash 16/3, 2011 at 20:51

@Mouthwash We're talking about multibyte character literals, not multibyte string literals. Do try to pay attention. – Verdure 17/3, 2011 at 3:36

Chrashworks did write characters. You should have written that wide character literals (say L'à') do take more bytes but are not called multibyte char literals. Being less arrogant would help you to be more accurate yourself. – Vina 23/3, 2011 at 23:0

@Vina Wide character literals aren't relevant here -- they have nothing to do with what I wrote. I was accurate and you lack comprehension and your bogus attempt to correct me is what's arrogant. – Verdure 26/3, 2011 at 13:33

Two questions: 1) Was the mention of UTF referring to UTF-8? 2) Is it impossible/obsolete to encode a single non-ASCII UTF-8 character as a multibyte character literal? If the answer to both questions is yes, as it seems (can't find definitive answers on either), I owe you some apologies for questioning your accuracy. – Vina 26/3, 2011 at 22:41

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags