Are pointer variables just integers with some operators or are they "symbolic"?

Asked 17/8, 2015 at 8:31 Answered 14/6, 2018 at 20:11

Solved c++pointers language-lawyer undefined-behavior

-1

EDIT: The original word choice was confusing. The term "symbolic" is much better than the original ("mystical").

In the discussion about my previous C++ question, I have been told that pointers are

This does not sound right! If nothing is symbolic and a pointer is its representation, then I can do the following. Can I?

#include <stdio.h>
#include <string.h>

int main() {
    int a[1] = { 0 }, *pa1 = &a[0] + 1, b = 1, *pb = &b;
    if (memcmp (&pa1, &pb, sizeof pa1) == 0) {
        printf ("pa1 == pb\n");
        *pa1 = 2;
    }
    else {
        printf ("pa1 != pb\n");
        pa1 = &a[0]; // ensure well defined behaviour in printf
    }
    printf ("b = %d *pa1 = %d\n", b, *pa1);
    return 0;
 }

This is a C and C++ question.

Testing with Compile and Execute C Online with GNU GCC v4.8.3: gcc -O2 -Wall gives

pa1 == pb                                                                                                                                                                                       
b = 1 *pa1 = 2

Testing with Compile and Execute C++ Online with GNU GCC v4.8.3: g++ -O2 -Wall

pa1 == pb                                                                                                                                                                                       
b = 1 *pa1 = 2

So the modification of b via (&a)[1] fails with GCC in C and C++.

Of course, I would like an answer based on standard quotes.

EDIT: To respond to criticism about UB on &a + 1, now a is an array of 1 element.

Additional note: the term "mystical" was first used, I think, by Tony Delroy here. I was wrong to borrow it.

Indict answered 17/8, 2015 at 8:31 Comment(23)

Your sample code has UB. – Animal 17/8, 2015 at 8:35

The compiler is free to arrange variables, to your code may work as you expect or it may not. It's undefined bahaviour. – Allanadale 17/8, 2015 at 8:36

[expr.add]/5 "[for pointer addition, ] if both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined." – Selflove 17/8, 2015 at 8:42

@Selflove In case it makes a difference, I have changed a to an array. – Indict 17/8, 2015 at 8:45

Dereferencing &a + 1 is undefined, and the compiler is free to assume that doing it does not modify b and instead inline b's value. – Argufy 17/8, 2015 at 8:46

@Indict : Why ? Because the standard doesn't require the compiler to arrage variables in a specific way. – Allanadale 17/8, 2015 at 8:46

@Indict it doesn't make a difference, b is not an element of the array, so the behaviour is undefined. – Selflove 17/8, 2015 at 8:46

@Argufy So two pointers with equal values can have different semantic values? – Indict 17/8, 2015 at 8:49

@Indict Yes, an invalid pointer has different semantics than a valid one. In particular, dereferencing an invalid pointer makes your entire program undefined. – Argufy 17/8, 2015 at 8:52

@Argufy What is an "invalid pointer"? – Indict 17/8, 2015 at 9:6

@curiousguy: a pointer is invalid when it does not point at an object, a member of an array, or one past the end of an array. – Parget 17/8, 2015 at 9:9

@ZanLynx With the change of a to array of 1 int, the pointer it is valid. – Indict 17/8, 2015 at 9:9

@curiousguy: You're allowed to have a pointer one past the end. But you aren't allowed to dereference it. There's nothing there. Also, the compiler is allowed to look at your pointer use and reduce everything it sees. So you declare b and you declare pointers. But the compiler is free to delete all of that and in fact reduce your entire program to one print statement if it feels like it. – Parget 17/8, 2015 at 9:12

@Indict the value of a pointer to the hypothetical element after an array is well-defined, but dereferencing it is undefined behaviour. – Selflove 17/8, 2015 at 9:12

@ZanLynx So a pointer is more than its bit pattern. – Indict 17/8, 2015 at 9:15

@curiousguy: On x86 and x64 it is a bit pattern. The compiler assumes that all code follows the rules and it may not notice that you changed the bit pattern. Or it might move things into registers and remove the pointers entirely, causing your "clever thing" to disappear. If you don't follow the rules, the compiler optimizations will destroy you. – Parget 17/8, 2015 at 9:18

@Indict Yes, it "is" more than a bit pattern, even though the bit pattern is the entire representation. And so are ints, floats, and everything else. Using the value of an uninitialised int object is also undefined, regardless of the bit pattern it stores. – Argufy 17/8, 2015 at 9:19

@ZanLynx "it may not notice that you changed the bit pattern" I did not – Indict 17/8, 2015 at 9:22

@Allanadale "Because the standard doesn't require the compiler to arrage variables in a specific way." Of course the compiler could randomize the addresses of complete objects. But then, during every program run, the addresses once set are well defined and can be used for mathematical computations are an address is just a number. When the compiler has "arranged" the objects in memory, it is committed to this "arrangement" at least during this program execution, and I can play. – Indict 7/6, 2018 at 3:36

@Argufy Would you agree that two pointers with the same value are either both valid or both invalid? – Indict 7/6, 2018 at 3:38

@ZanLynx "Also, the compiler is allowed to look at your pointer use and reduce everything it sees" This is a language-lawyer question. Please provide a quote. – Indict 15/6, 2018 at 1:32

@Indict It is the as-if rule, see en.cppreference.com/w/cpp/language/as_if and https://mcmap.net/q/20201/-what-exactly-is-the-quot-as-if-quot-rule the answer there has a reference to parts of the C++11 standard. – Parget 15/6, 2018 at 2:38

"The "as-if" rule basically defines what transformations an implementation is allowed to perform on a legal C++ program" Yes and nobody has been able to point to a rule explicitly allowing that transformation. – Indict 2/7, 2018 at 7:16

C was conceived as a language in which pointers and integers were very intimately related, with the exact relationship depending upon the target platform. The relationship between pointers and integers made the language very suitable for purposes of low-level or systems programming. For purposes of discussion below, I'll thus call this language "Low-Level C" [LLC].

The C Standards Committee wrote up a description of a different language, where such a relationship is not expressly forbidden, but is not acknowledged in any useful fashion, even when an implementation generates code for a target and application field where such a relationship would be useful. I'll call this language "High Level Only C" [HLOC].

In the days when the Standard was written, most things that called themselves C implementations processed a dialect of LLC. Most useful compilers process a dialect which defines useful semantics in more cases than HLOC, but not as many as LLC. Whether pointers behave more like integers or more like abstract mystical entities depends upon which exact dialect one is using. If one is doing systems programming, it is reasonable to view C as treating pointers and integers as intimately related, because LLC dialects suitable for that purpose do so, and HLOC dialects that don't do so aren't suitable for that purpose. When doing high-end number crunching, however, one would far more often being using dialects of HLOC which do not recognize such a relationship.

The real problem, and source of so much contention, lies in the fact that LLC and HLOC are increasingly divergent, and yet are both referred to by the name C.

Whisker answered 14/6, 2018 at 20:11 Comment(0)

The first thing to say is that a sample of one test on one compiler generating code on one architecture is not the basis on which to draw a conclusion on the behaviour of the language.

c++ (and c) are general purpose languages created with the intention of being portable. i.e. a well formed program written in c++ on one system should run on any other (barring calls to system-specific services).

Once upon a time, for various reasons including backward-compatibility and cost, memory maps were not contiguous on all processors.

For example I used to write code on a 6809 system where half the memory was paged in via a PIA addressed in the non-paged part of the memory map. My c compiler was able to cope with this because pointers were, for that compiler, a 'mystical' type which knew how to write to the PIA.

The 80386 family has an addressing mode where addresses are organised in groups of 16 bytes. Look up FAR pointers and you'll see different pointer arithmetic.

This is the history of pointer development in c++. Not all chip manufacturers have been "well behaved" and the language accommodates them all (usually) without needing to rewrite source code.

Demonism answered 17/8, 2015 at 8:47 Comment(10)

The compiler generated is simply is illustration of the fact that GCC doesn't support this crazy idea. It isn't used as "proof" of anything, and it doesn't work the modified code (the one with the array). – Indict 18/8, 2015 at 20:9

C was designed so that the language could be ported to many machines, and that a programmer who was familiar with C and familiar with the general characteristics of a particular architecture would know to write C code for that architecture. The design of the language is hostile to the writing of architecture-agnostic code. On the other hand, the reason C became popular is that it didn't try to be "one language", but instead a family of dialects that could exploit the various strengths of different architectures. – Whisker 12/7, 2018 at 6:54

@Whisker when you write "The design of the language is hostile to the writing of architecture-agnostic code." I have to say that this conflicts with my life experience. As written above, I have written C on systems based on Z80, 6502, 6809, 68000, 80x86 and TMS9900, both with and without paged memory and with all kinds of I/O mappings. The C language (and a couple of portability macros) allowed the same same source code to compile into functional programs (and mini-OS) for all these systems. The only points of customisation were a few macro definitions, device drivers and linker maps. – Demonism 12/7, 2018 at 8:8

@RichardHodges: There's a difference between writing code that will work on a particular set of architectures, and writing code that is truly architecture agnostic. The preprocessor can help a lot with portability issues, but a language designed to facilitate architecture-agnostic code would specify that math will behave in two's-complement fashion, even if that means using unsigned math on the underlying architecture and then adding code to handle the situations where it behaves differently from signed. It would also specify architecture-independent promotion rules for.. – Whisker 12/7, 2018 at 14:49

..."fixed-sized" types. Writing a Java implementation for something like a 36-bit machine would be "interesting", but if the platform supports compare-and-swap, or if the implementation runs on a single core and gets to control scheduling if its threads, I think it would be possible to achieve halfway-decent performance. By contrast, most C programs written for common microprocessors would be completely useless on a 36-bit machine. – Whisker 12/7, 2018 at 14:55

@Whisker I would agree that not all C programs are well written. It is worth noting that C compilers existed for DEC and IBM architectures which had 9-bit chars and 36-bit words. The integral type sizes in C were deliberately vague for precisely this reason. Writers of portable programs don't as a rule seek to depend on integer overflow behaviour. – Demonism 12/7, 2018 at 16:5

@RichardHodges: I've written a TCP stack on a platform with 16-bit "char", and using a language that was pretty much like normal C except for the 16-bit char was definitely nicer than writing everything in TMS3205x assembly code would have been, but the language did nothing to help with making my code be architecture-agnostic. A language designed to let people write architecture-agnostic code should include data types with architecture-agnostic semantics, even if they need to be emulated or even make certain programs incompatible with some platforms. For performance, it may also... – Whisker 12/7, 2018 at 19:49

...have "native" data types, but my job would have been a lot easier if there were a means of declaring a "16 bits stored as two octets little-endian" data type and have a compiler generate code that would split a write of such a value into two "char"-sized writes [using the bottom 8 bits of each "char"]. If such a type existed, a TCP stack for the PC that used such types would have been easily portable to the TMS part. Parts of it may have performed unacceptably slowly using such emulated types, and thus had to be hand-tweaked to use native types, but that would have been nicer... – Whisker 12/7, 2018 at 19:54

@Whisker can't disagree with that. I had to define pseudo-types for such concepts as "index into array" as signed/unsigned types of 16/8 bits implied vastly difference performance and space characteristics between Z80, 6809 etc. Still, the end result was 100% portable with only 2 hours of configuration. – Demonism 13/7, 2018 at 7:17

@RichardHodges: It may have been 100% portable among quality general-purpose implementations suitable for low-level programming on a certain subset of platforms, but it would not be "portable" in the sense that the Standard uses the term, nor would it necessarily be reliably portable among "modern" compilers for those platforms. – Whisker 13/7, 2018 at 15:47

Stealing the quote from TartanLlama:

[expr.add]/5 "[for pointer addition, ] if both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined."

So the compiler can assume that your pointer points to the a array, or one past the end. If it points one past the end, you cannot defererence it. But as you do, it surely can't be one past the end, so it can only be inside the array.

So now you have your code (reduced)

b = 1;
*pa1 = 2;

where pa points inside an array a and b is a separate variable. And when you print them, you get exactly 1 and 2, the values you have assigned them.

An optimizing compiler can figure that out, without even storing a 1or a 2 to memory. It can just print the final result.

Astaire answered 17/8, 2015 at 9:8 Comment(13)

"If it points one past the end, you cannot defererence it" This one isn't clear; what does "point" mean? – Indict 18/8, 2015 at 20:17

You know fine well what it means. It holds the address of the hypothetical a[N], i.e. if the array were 1 element larger, it would point at the final element. The real questions: Why on Earth have you made so many questions about this concept? Would it be useful for anything if it weren't UB? – Zoophobia 27/2, 2016 at 18:1

@Zoophobia If a pointer is a trivial type, than two pointers with the same representation must be point to the same set of things. So a one past the end pointer with the same representation as the pointer to the object after the array must point to that object. Are you pretending that pointers aren't really trivial types? – Indict 13/6, 2018 at 23:3

I'm not pretending anything, but you appear to have pretended a 2nd pointer exists when none did in this discussion.. The 1st pointer, i.e. the one that actually exists in this discussion, points one past the end and is valid to form, but not to dereference. – Zoophobia 14/6, 2018 at 8:22

@Zoophobia Are you saying that a pointer cannot be both one past the end and pointing to an object? I'm struggling with that. – Indict 2/7, 2018 at 7:21

@Indict - Yes, that's probably what he is saying. If you have a pointer and move it past the end, it no longer points to an object. Now you could also have some other pointer that does point to an object and that object just could have the same address as the past-the-end. But they are still different pointers and not interchangable. – Astaire 2/7, 2018 at 7:29

@BoPersson I have no problem with the idea that two objects can be equal by any allowed programmatic measurement and still be different. (It just means that the means of measurements are limited.) It's more difficult to accept that two value storing objects can be different if they store equal values. We know that pointers are value storing objects in all currently used compilers. There is no hidden flag in a pointer representation that wouldn't be measurable by ==. (This can be confirmed by memcmp.) That's my difficulty. – Indict 2/7, 2018 at 7:36

Not only that, we also know that a pointer value can be converted to an integer and back to a pointer, so the integer must fully represent the complete value of the pointer. So two pointers with identical representation will be converted to equal integer values. Are you saying that integers can hold the same value but still be different? – Indict 2/7, 2018 at 7:39

@Indict - Those are the rules. :-) The rules were set at a time when segmented memory was still common. And segments could overlap, so memcmp wasn't reliable - different bit patterns segment:offset could mean the same address. And vice versa - with arrays allocated in separate segments, the same pointer bit pattern meant different objects depending on which segment was used as a base. – Astaire 2/7, 2018 at 7:49

@BoPersson I understand that a given value for a type can have many different representation. That could also be the case with a fraction class where different fraction representations are indistinguishable by any allowed measurement while still not comparing equal via memcmp, which is not an "allowed measurement" for such type. But two fractions with identical representation must be equal. This is implied by the fact that the intrinsic value of a fraction object is determined ONLY by the state of its members. – Indict 2/7, 2018 at 7:55

"the same pointer bit pattern meant different objects" How would the compile manage to make an access to the right object, given an ambiguous pointer value? – Indict 2/7, 2018 at 7:58

@Indict - It's part of the segment:offset addressing. The segment part had to be loaded into a segment register, and then you could use just the offset as a pointer into an array stored in that segment. To move to a different array the compiler would have to reload the segment register and then use another set of pointers. – Astaire 2/7, 2018 at 8:52

@curiousguy: Except in when using huge pointers (which are seldom used, because they are extremely slow and inefficient), all accesses made to a particular object will use the same segment, and a compiler will assume that two pointers with different segments cannot identify the same object or portions thereof. Consequently, individual objects are generally limited to 65520 (i.e. 65536-16) bytes. The answer to when a compiler should change the segment part of a non-huge pointer to an object is simply: never. – Whisker 13/7, 2018 at 15:39

If you turn off the optimiser the code works as expected.

By using pointer arithmetic that is undefined you are fooling the optimiser. The optimiser has figured out that there is no code writing to b, so it can safely store it in a register. As it turns out, you have acquired the address of b in a non-standard way and modify the value in a way the optimiser doesn't see.

If you read the C standard, it says that pointers may be mystical. gcc pointers are not mystical. They are stored in ordinary memory and consist of the same type of bytes that make up all other data types. The behaviour you encountered is due to your code not respecting the limitations stated for the optimiser level you have chosen.

Edit:

The revised code is still UB. The standard doesn't allow referencing a[1] even if the pointer value happens to be identical to another pointer value. So the optimiser is allowed to store the value of b in a register.

Beaker answered 17/8, 2015 at 9:3 Comment(4)

Comments are not for extended discussion; this conversation has been moved to chat. – Creighton 18/8, 2015 at 17:38

The optimizers in gcc and clang treats pointers as mystical. They also treats values of type uintptr_t as mystical. If int *p can be used to access an object and int *q has the same bit pattern but cannot be used to identify the object, gcc's optimizer will even go so far in some cases as to say assume in some cases where uintptr_t uptr is known to be equal to (uintptr_t)q, an access to (int*)uptr won't affect *p, even if the value in uptr happens to actually be derived from (uintptr)p. – Whisker 13/7, 2018 at 16:30

@Whisker "even if the value (...)" when would that happen? – Indict 14/4, 2019 at 12:56

@curiousguy:Given

#include <stdint.h>  extern int x,y[]; int test(uintptr_t z) {     x = 1;     if (z == (uintptr_t)(1+y))     {         *(int*)z=2;     }     return x; }

gcc will ignore the possibility that *(int*)z might identify x, even though the behavior of test((uintptr_t)&x) should be defined as always either returning 1 with no side-effect, or writing 2 to x and then returning 2. – Whisker 15/4, 2019 at 17:6

The real problem, and source of so much contention, lies in the fact that LLC and HLOC are increasingly divergent, and yet are both referred to by the name C.

Whisker answered 14/6, 2018 at 20:11 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags