ASCII strings and endianness
Asked Answered
L

12

54

An intern who works with me showed me an exam he had taken in computer science about endianness issues. There was a question that showed an ASCII string "My-Pizza", and the student had to show how that string would be represented in memory on a little endian computer. Of course, this sounds like a trick question because ASCII strings are not affected by endian issues.

But shockingly, the intern claims his professor insists that the string would be represented as:

P-yM azzi

I know this can't be right. There is no way an ASCII string would be represented like that on any machine. But apparently, the professor is insisting on this. So, I wrote up a small C program and told the intern to give it to his professor.

#include <string.h>
#include <stdio.h>

int main()
{
    const char* s = "My-Pizza";
    size_t length = strlen(s);
    for (const char* it = s; it < s + length; ++it) {
        printf("%p : %c\n", it, *it);
    }
}

This clearly demonstrates that the string is stored as "My-Pizza" in memory. A day later, the intern gets back to me and tells me the professor is now claiming that C is automagically converting the addresses to display the string in proper order.

I told him his professor is insane, and this is clearly wrong. But just to check my own sanity here, I decided to post this on stackoverflow so I could get others to confirm what I'm saying.

So, I ask : who is right here?

Litotes answered 14/10, 2009 at 18:12 Comment(16)
Do you have access to a debugger to show the prof? Is this linux or windows?Willock
Sure. The same thing could be demonstrated using gdb on linux, by examining each byte in memoryLitotes
No need for a debugger: the OP's (well-played) use of the %p format specifier tells you all you really need to know.Joanajoane
Though that strlen() in a for() loop conditional makes me cringe.Joanajoane
SERIOUSLY? Who is this guy? (and +1 for Chris).Valleau
Mr. Lutz -- Aware of the %p I felt that it will not be enough for the professor in question. After all, the professor already feels that the ++ operator does something clever with char * to "jump around", it might as well also somehow renumber itself when passed to printf(). A debugger being another implmentation and language-agnostic, I thought it might educate the prof. ;)Willock
Check the assembler & write your own assembler routine. Also...I hope never to meet that prof.Scarlet
I don't suppose you'd care to name this professor.Tamera
Although it doesn't matter in this question I removed the strlen call from the loop so that less people write like that when coming for an interview.Fradin
Perhaps I'm giving the prof too much credit but the fact that the "humanized output explanation" didn't occur to anyone makes me think that SO has really botched the answer to this one...Patentor
@Ross, I think you're missing the point; this professor claims for some reason that endianness issues (which by definition only affect types larger than a 8-bits) are affecting 8-bit data. Can you explain what you think he's trying to say?Valleau
I just gave an example in my answer. Sure, I can't read the prof's mind, but the fact that this alternate explanation didn't even occur to people is .. a concern.Patentor
Another explanation is that both the prof and the SO crowd are "not quite getting it". If the prof is wrong, it should have at least occurred to you guys why he might have been wrong. I still think it's just a representation issue. I guess we would have to interview the prof to know for sure.Patentor
$ cat > /tmp/pizza My-Pizza$ $ od -X /tmp/pizza 0000000 502d794d 617a7a69 0000010 $ For the record, y == 79, M == 4d. Get the point?Patentor
Yeah I see that in your answer below, but I don't see how you can get that interpretation from the question. It seems pretty clear to me what's going on. If the prof were wrong and cleared it up instead of trying to perpetuate it with his day-later response, I think that would be a different story.Valleau
@Ross, you're conflating how the string can be represented in a certain format, versus how it is actually stored in memory, which is the issue here. By your logic, a Spanish translation of the string would also be a valid "representation" because it is one way a particular application may "interpret" the string.Litotes
W
36

Without a doubt, you are correct.

ANSI C standard 6.1.4 specifies that string literals are stored in memory by "concatenating" the characters in the literal.

ANSI standard 6.3.6 also specifies the effect of addition on a pointer value:

When an expression that has integral type is added to or subtracted from a pointer, the result has the type of the pointer operand. If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integral expression.

If the idea attributed to this person were correct, then the compiler would also have to monkey around with integer math when the integers are used as array indices. Many other fallacies would also result which are left to the imagination.

The person may be confused, because (unlike a string initializer), multi-byte chacter constants such as 'ABCD' are stored in endian order.

There are many reasons a person might be confused about this. As others have suggested here, he may be misreading what he sees in a debugger window, where the contents have been byte-swapped for readability of int values.

Willock answered 14/10, 2009 at 18:25 Comment(6)
It may be that the professor is looking at memory in his debugger in a 32-bit mode and is confused by the endianness?Valleau
This is all just a communication gap due to so few people having seen an actual dump and the fact that no one here recognizes that you have to print one thousand as 1,000, not 000,1. This totally wrong answer has 8 votes from equally confused readers...Patentor
@DigitalRoss. Listen, Ross, I don't appreciate your comment. I have been reading dumps for 29 years at this point. My answer is totally correct. Witness to this fact is your inability to explicate any specific to the contrary. Or: please do explain yourself.Willock
The confusion lies in understanding the byte order within words of a little endian machine. ASCII characters will be stored in successive bytes, but bytes are physically ordered "backwards" in memory itself. Please see my answer for references and pictures https://mcmap.net/q/183763/-ascii-strings-and-endiannessAnglaangle
@Nick. I suspect you are the -1 vote that ticked me off yesterday. Your answer is disinformation. Obviously, it is true that viewing a dump of 32-bit words in a little endian machine will produce the visual which resembles what OP asked about. That is not the same thing as OP inquired. We have zero evidence that the professor was referring to this, in fact we have evidence TO THE CONTRARY: " A day later, the intern gets back to me and tells me the professor is now claiming that C is automagically converting the addresses to display the string in proper order."Willock
Everybody here knows already that viewing sequential byte data as words on a little endian machine will show swapped bytes -- that is practically the definition of little endian. The claims that OP relates were made by his professor were not about viewing dumps in the debugger. At the very least, OP had received information that the claim was about the actual order of the bytes in memory. It's rather irritating that arm-chair psychologists are trying to reach into the mind of the professor, criticizing correct answers which do not. I think these people are slaves to authority figures.Willock
R
19

Endianness defines the order of bytes within multi-byte values. Character strings are arrays of single-byte values. So each value (character in the string) is the same on both little-endian and big-endian architectures, and endianness does not affect the order of values in a structure.

Rillings answered 5/5, 2016 at 19:6 Comment(0)
C
16

The professor is confused. In order to see something like 'P-yM azzi' you need to take some memory inspection tool that displays memory in '4-byte integer' mode and at the same time gives you a "character interpretation" of each integer in higher-order byte to lower-order byte mode.

This, of course, has nothing to do with the string itself. And to say that the string itself is represented that way on a little-endian machine is utter nonsense.

Chatoyant answered 14/10, 2009 at 18:45 Comment(4)
OK, @AndreyT, I think I need your help on this one. As usual, you are right, but could it be: that's exactly what the prof meant? I have a feeling the SO crowd has lurched in the wrong direction on this one...Patentor
Hmm... Maybe, but what would be the "correct" answer in this case? If one inspects little-endian memory as a sequence of bytes, one'd see 'My-Pizza' in there. If one interpret it as a sequence of 2-byte ints, it would be 'yM P- zi az'. In case of 4-byte ints it's 'P-yM azzi'. And finally a 8-byte int interpretation would give 'azziP-yM'. All these "interpretations" are just that - interpretations, ways to display data in memory. All of them are "correct", once one understands where they come from. Nothing gives the professor the basis to insist on just one of them as the "right" one.Chatoyant
It makes very little sense for a debugger to say "This integer, if stored on a machine with different endianness, would represent this different string in memory".Might
Agreed with @AndreyT's comment. The professor should have specified the size of each word. In this case, the professor assumed a 4-byte (32-bit) word.Anglaangle
M
12

You can quite easily prove that the compiler is doing no such "magic" transformations, by doing the printing in a function that doesn't know it's been passed a string:

int foo(const void *mem, int n)
{
    const char *cptr, *end;
    for (cptr = mem, end = cptr + n; cptr < end; cptr++)
        printf("%p : %c\n", cptr, *cptr);
}

int main()
{
    const char* s = "My-Pizza";

    foo(s, strlen(s));
    foo(s + 1, strlen(s) - 1);
}

Alternatively, you can even compile to assembly with gcc -S and conclusively determine the absence of magic.

Might answered 14/10, 2009 at 20:40 Comment(2)
+1 for ASM. Also, you can write this routine IN assembly just to prove it.Scarlet
+1 for assembly, I went back and linked to this answer from #1566067Fradin
D
10

The professor is wrong if we're talking about a system that uses 8 bits per character.

I often work with embedded systems that actually use 16-bit characters, each word being little-endian. On such a system, the string "My-Pizza" would indeed be stored as "yMP-ziaz".

But as long as it's an 8-bit-per-character system, the string will always be stored as "My-Pizza" independent of the endian-ness of the higher-level architecture.

Dato answered 14/10, 2009 at 18:23 Comment(9)
+1 Heath, I've done a lot of embedded work and never seen something weird like that.Valleau
One product I've worked on uses a Texas Instruments DSP (2808, I think), whose smallest addressable unit of memory is 16 bits.Dato
Aha, all bets are off when it comes to DSP. How would you write the OP's program with only 16-bit addressing? Do you have to decompose the 16-bit chunks into 8-bit pieces yourself?Valleau
Dmitry, that's cool about the DSP. Are you using a C compiler that has the "char *" type? What happens when you have char * p = "MyPi"; and perform "i=*p; p++; j=*p"? Can you individually address the bytes that are packed into the 16-bit words using a char * in C?Willock
A "char" in this compiler is actually 16 bits. So an ASCII string would be stored with each character taking up 16 bits, such as "M\0y\0-\0P\0 ...". So, in reality, what I wrote in my response would not happen in practice, at least for string literals. It does happen for long integers; i.e. 0x12345678 would be stored as 0x3412 0x7856.Dato
That seems more like what I would have expected for 16-bit minimum addressing.Valleau
I've seen this too, for example, when programming on a Canon A620 digital camera using the CHDK hack. Not only is the pixel data 10 bits packed, but the data is accessed in a 16-bit little-endian format. So you have to read 2 chars, swap them, repeat a few times, and then unpack.Degrease
Please stop calling them bytes with more than 8 bits... What you're talking about is the word size of the processor not the byte size... ;)Doone
@Doone Actually, as far as C standard is concerned, bytes are not necessarily 8-bits wide. C99 draft standard section 3.6 defines a byte as: addressable unit of data storage large enough to hold any member of the basic character set of the execution environment and its width in bits is stored in the constant CHAR_BIT, which must be greater or equal than 8. So if the smallest addressable memory unit of that DSP is a 16 bit word, on that system a byte is 16 bit wide.Salivate
E
2

But shockingly, the intern claims his professor insists that the string would be represented as:

P-yM azzi

It would be represented as, represented as what? represented to user as 32bit integer dump? or represented/layout in computer's memory as P-yM azzi?

If the professor said "My-Pizza" would be represented/layout as "P-yM azzi" in computer's memory because the computer is of little endian architecture, somebody, please, got to teach that professor how to use a debugger! I think that's where all the professor's confusions stems from, I have an inkling that the professor is not a coder(not that I'm looking down upon the professor), I think he don't have a way to prove in code what he learned about endian-ness.

Maybe the professor learned the endian-ness stuff just about a week ago, then he just use a debugger incorrectly, quickly delighted about his newly unique insight on computers and then preach it to his students immediately.

If the professor said endian-ness of machine has a bearing on how ascii strings would be represented in memory, he need to clean up his act, somebody should correct him.

If the professor gave an example instead on how integers are represented/layout in machines differently depending on machine's endianness, his students could appreaciate what he is teaching all about.

Elevator answered 15/10, 2009 at 7:35 Comment(0)
S
1

I assume the professor was trying to make a point by analogy about the endian/NUXI problem, but you're right when you apply it to actual strings. Don't let that derail from the fact that he was trying to teach students a point and how to think about a problem a certain way.

Sugary answered 14/10, 2009 at 18:27 Comment(1)
Teaching someone a "point" by telling lies isn't teaching anything. That's horrible, don't let him get away with it.Valleau
H
1

You may be interested, it is possible to emulate a little-endian architecture on a big-endian machine, or vice-versa. The compiler has to emit code which auto-magically messes with the least significant bits of char* pointers whenever it dereferences them: on a 32bit machine you'd map 00 <-> 11 and 01 <-> 10.

So, if you write the number 0x01020304 on a big-endian machine, and read back the "first" byte of that with this address-munging, then you get the least significant byte, 0x04. The C implementation is little-endian even though the hardware is big-endian.

You need a similar trick for short accesses. Unaligned accesses (if supported) may not refer to adjacent bytes. You also can't use native stores for types bigger than a word because they'd appear word-swapped when read back one byte at a time.

Obviously however, little-endian machines do not do this all the time, it's a very specialist requirement and it prevents you using the native ABI. Sounds to me as though the professor thinks of actual numbers as being "in fact" big-endian, and is deeply confused what a little-endian architecture really is and/or how its memory is being represented.

It's true that the string is "represented as" P-yM azzi on 32bit l-e machines, but only if by "represented" you mean "reading the words of the representation in order of increasing address, but printing the bytes of each word big-endian". As others have said, this is what some debugger memory views might do, so it is indeed a representation of the contents of the memory. But if you're going to represent the individual bytes, then it is more usual to list them in order of increasing address, no matter whether words are stored b-e or l-e, rather than represent each word as a multi-char literal. Certainly there is no pointer-fiddling going on, and if the professor's chosen representation has led him to think that there is some, then it has misled him.

Herniorrhaphy answered 14/10, 2009 at 21:15 Comment(4)
What!? Name me one such compiler that emits these automagic codes the munge the bottom two bits of every pointer access everywhere.Hospitalize
I have specialized library functions for doing this on the 1 in 10 million case this is actually correct.Alis
@Adam: not strictly the compiler, but the so-called "translator", which you can consider like a compiler back-end, for Tao Group's now sadly defunct "intent". The intent environment was always little-endian, even on big-endian hardware. This made implementing network drivers a little confusing, since intent code had one endianness, and inline native assembler the opposite. And as I specifically stated, it did not munge every pointer access, it only munged non word-size pointer access. Made it easier for writers of portable apps to test, because they didn't need a b-e platform to hand.Herniorrhaphy
The more important goal, though, was that intent had a virtual assembler language and byte code, which in order to be portable needed to have a consistent endian-ness, consistent sizes of builtin types, etc. It was then up to the translator to make this work on a given platform.Herniorrhaphy
M
0

Also, (And I haven't played with this in a long time, so I might be wrong) He might be thinking of pascol, where strings are represented as "packed arrays" which, IIRC are characters packed into 4 byte integers?

Manipulate answered 14/10, 2009 at 21:52 Comment(0)
P
0

It's hard to read the prof's mind and certainly the compiler is not doing anything other than storing bytes to adjacent increasing addresses on both BE and LE systems, but it is normal to display memory in word-sized numbers, for whatever the word size is, and we write one thousand as 1,000. Not 000,1.

$ cat > /tmp/pizza
My-Pizza^D
$ od -X /tmp/pizza
0000000 502d794d 617a7a69
0000010
$ 

For the record, y == 79, M == 4d.

Patentor answered 15/10, 2009 at 0:48 Comment(12)
Actually, such a format is pretty standard. A 32-bit dump with ASCII alongside in my ARM debugger shows me the 32-bit words in the right (logical) order, but the ASCII dump is in bytewise order.Valleau
Agreed, but also kinda my point. You needed two dumps with opposite paradigms, printed side-by-side.Patentor
I should add that of course I can't read the prof's mind. But I'm a bit shocked that an interpretation that made the prof's point perfectly valid didn't seem to occur to a lot of people.Patentor
Probably because it's utterly ridiculous to use a ten-mile-long confusing explanation to justify a statement that is still completely wrong. The question was whether the bytes are in memory in that order, and they're not. The fact that they will appear backwards if you go out of your way to print them backwards proves nothing.Birdseed
No, this idea occurred to Carl Norum 5 hours before your post. The OP made a specific statement with: "A day later, the intern gets back to me and tells me the professor is now claiming that C is automagically converting the addresses to display the string in proper order." The OP seems to have faith in the intern who is passing the message for him, but that could surely be the problem. Also, the OP wants to know what is correct, and he seems to want some references. I agree with your psychoanalysis that this likely stemmed from a miscommunication, but does that answer the OP's question?Willock
When I'm sayng that the professor is confused, I mean that he's wrong to insist on one and only one representation method as The Only True One, while, as you yourself said above, they both are right. Moreover, there are more ways to interpret the memory contents in this case. Now, as an additional note, when one's talking about strings (sequences of bytes), trying to push a 4-byte int memory view as the only appropriate way to inspect the memory is what I'd call "unorthodox".Chatoyant
Frankly it doesn't matter whether the Prof understand endian-ness or not. The fact that his student has come away, having asked a specific question, with the impression that C is automagically converting addresses, means that (at least one of) the Prof's students don't understand endian-ness. Whoever first said the words "converting addresses" is in the wrong here, because that is what is wrong. Arguing over how little-endian memory should be represented is one thing, and sure both people can be right. Thinking that any addresses are being reversed by C is factually incorrect.Herniorrhaphy
Look, assuming the intern I'm speaking with is giving me the facts accurately, the professor is simply wrong. Some here have argued that the professor is correct "from a certain point of view", i.e. the string can be "represented" as "P-yM azzi" if you use a debugger and interpret the memory as a 32-bit integer. Granted, this is true, but this is totally misleading and has no bearing on how the string is ACTUALLY stored in memory. And certainly, it is totally false that the C language does any kind of address "remapping" under the hood to compensate for endianness.Litotes
You're incorrect that this representation has no bearing on how strings are actually stored in memory. It describes the contents of the memory by a 1-1 mapping. If the Prof has said that for his course, this is how memory contents will be represented, then that's how the string in question "is represented". However, he's failed to explain what's actually going on, which is presumably a fault in the lessons. He's also wrong if he thinks that's the only way to describe memory, just as you'd be wrong to say lower-address bytes are ACTUALLY on the left.Herniorrhaphy
Did you really need to post this as an answer AND as a comment?Leet
+1 (from zero) for a simple and simply expressed explanation of what the Prof was no doubt trying to say. Not sure why this was so controversial.Spillar
@Patentor I expect you to respond to my comment, Mr. 60k points from 1800 answers...Willock
B
0

AFAIK, endianness only makes sense when you want to break a large value into small ones. Therefore I don't think that C-style string are affected with it. Because they are after all just arrays of characters. When you are reading only one byte, how could it matter if you read it from left or right?

Betterment answered 15/10, 2009 at 5:36 Comment(0)
A
0

I came across this and felt the need to clear it up. No one here seems to have addressed the concept of bytes and words or how to address them. A byte is 8-bits. A word is a collection of bytes.

If the computer is:

  • byte addressable
  • with 4-byte (32-bit) words
  • word aligned
  • the memory is viewed "physically" (not dumped and byte-swapped)

then indeed, the professor would be correct. His failure to indicate this proves he doesn't exactly know what he is talking about, but he did understand the basic concept.

Byte Order Within Words: (a) Big Endian, (b) Little Endian

Byte Order Within Words: (a) Big Endian, (b) Little Endian

Character and Integer Data in Words: (a) Big Endian, (b) Little Endian

Character and Integer Data in Words: (a) Big Endian, (b) Little Endian

References

Anglaangle answered 28/1, 2013 at 20:37 Comment(7)
you wrote, "then indeed, the professor would be correct." And that is absolutely false. OP presented professor (via intern) with some C code that you may want to study until you understand it. In the meanwhile, I see you are able to assist people who use JavaScript and stuff like that.Willock
@Heath - The C code would have the same result executed on Big Endian or Little Endian. The physical diagram above for little endian makes the data look backwards but when it is traversed from increasing byte address, one byte at a time it would print in the same order on either system and result in "My-Pizza". Architecture professor wanted to see it displayed like the 2nd diagram above for Little Endian. This is very common type of question in computer architecture classes. This is the correct answer and I will go with the Intel published document being correct on this one.Wisdom
@Wisdom - There is no question as to the intel document or other well-known representations in word address (such as a "DD" command in a debugger). The question would be: how do these correct representations relate to the incorrect representation given by OP? The answer is psychological: they are attempts to make sense of the nonsense presented in the question. On their own, they are axiomatic in their correctness. In terms of answering OP's question, they are wrong. To answer in these terms; wrong. To pretend I question the convention: straw man. Good day, axawire.Willock
@HeathHunnicutt as a student this was by far the most useful answer. It may be wrong by the conventions you use, but it helps me understand what is happening at a hardware level.Lorrielorrimer
@user2161613 do you understand that the ASCII string is stored in memory one character after the other, without any byte-swapping? Because that's the fact. This answer, for all of its nifty graphics, is basically wrong. If the memory is viewed "physically," the characters will be in order.Willock
@HeathHunnicutt yeh I've been working on this again today, and you are right. This answer is confusing. The best explanation I've seen so far is simply playing with the MARS MIPS simulator. Is the simple version big endian looks right for numbers and little endian looks right for strings? By "looks right" I mean numbers will appear with the MSF first and strings appear as if you are reading left to right.Lorrielorrimer
I think this was the answer I was looking for electronics.stackexchange.com/questions/1760/…Lorrielorrimer

© 2022 - 2024 — McMap. All rights reserved.