What A Provocative Question!
Even cursory scanning of the responses and comments in this thread will reveal how emotive your seemingly simple and straight forward query turns out to be.
It should not be surprising.
Inarguably, misunderstandings around the concept and use of pointers represents a predominant cause of serious failures in programming in general.
Recognition of this reality is readily evident in the ubiquity of languages designed specifically to address, and preferably to avoid the challenges pointers introduce altogether. Think C++ and other derivatives of C, Java and its relations, Python and other scripts -- merely as the more prominent and prevalent ones, and more or less ordered in severity of dealing with the issue.
Developing a deeper understanding of the principles underlying, therefore must be pertinent to every individual that aspires to excellence in programming -- especially at the systems level.
I imagine this is precisely what your teacher means to demonstrate.
And the nature of C makes it a convenient vehicle for this exploration. Less clearly than assembly -- though perhaps more readily comprehensible -- and still far more explicitly than languages based on deeper abstraction of the execution environment.
Designed to facilitate deterministic translation of the programmer’s intent into instructions that machines can comprehend, C is a system level language. While classified as high-level, it really belongs in a ‘medium’ category; but since none such exists, the ‘system’ designation has to suffice.
This characteristic is largely responsible for making it a language of choice for device drivers, operating system code, and embedded implementations. Furthermore, a deservedly favoured alternative in applications where optimal efficiency is paramount; where that means the difference between survival and extinction, and therefore is a necessity as opposed to a luxury. In such instances, the attractive convenience of portability loses all its allure, and opting for the lack-lustre performance of the least common denominator becomes an unthinkably detrimental option.
What makes C -- and some of its derivatives -- quite special, is that it allows its users complete control -- when that is what they desire -- without imposing the related responsibilities upon them when they do not. Nevertheless, it never offers more than the thinnest of insulations from the machine, wherefore proper use demands exacting comprehension of the concept of pointers.
In essence, the answer to your question is sublimely simple and satisfyingly sweet -- in confirmation of your suspicions. Provided, however, that one attaches the requisite significance to every concept in this statement:
- The acts of examining, comparing and manipulating pointers are always and necessarily valid, while the conclusions derived from the result depends on the validity of the values contained, and thus need not be.
The former is both invariably safe and potentially proper, while the latter can only ever be proper when it has been established as safe. Surprisingly -- to some -- so establishing the validity of the latter depends on and demands the former.
Of course, part of the confusion arises from the effect of the recursion inherently present within the principle of a pointer -- and the challenges posed in differentiating content from address.
You have quite correctly surmised,
I'm being led to think that any pointer can be compared with any other pointer, regardless of where they individually point. Moreover, I think pointer arithmetic between two pointers is fine, no matter where they individually point because the arithmetic is just using the memory addresses the pointers store.
And several contributors have affirmed: pointers are just numbers. Sometimes something closer to complex numbers, but still no more than numbers.
The amusing acrimony in which this contention has been received here reveals more about human nature than programming, but remains worthy of note and elaboration. Perhaps we will do so later...
As one comment begins to hint; all this confusion and consternation derives from the need to discern what is valid from what is safe, but that is an oversimplification. We must also distinguish what is functional and what is reliable, what is practical and what may be proper, and further still: what is proper in a particular circumstance from what may be proper in a more general sense. Not to mention; the difference between conformity and propriety.
Toward that end, we first need to appreciate precisely what a pointer is.
- You have demonstrated a firm grip on the concept, and like some others may find these illustrations patronizingly simplistic, but the level of confusion evident here demands such simplicity in clarification.
As several have pointed out: the term pointer is merely a special name for what is simply an index, and thus nothing more than any other number.
This should already be self-evident in consideration of the fact that all contemporary mainstream computers are binary machines that necessarily work exclusively with and on numbers. Quantum computing may change that, but that is highly unlikely, and it has not come of age.
Technically, as you have noted, pointers are more accurately addresses; an obvious insight that naturally introduces the rewarding analogy of correlating them with the ‘addresses’ of houses, or plots on a street.
In a flat memory model: the entire system memory is organized in a single, linear sequence: all houses in the city lie on the same road, and every house is uniquely identified by its number alone. Delightfully simple.
In segmented schemes: a hierarchical organization of numbered roads is introduced above that of numbered houses so that composite addresses are required.
- Some implementations are still more convoluted, and the totality of distinct ‘roads’ need not sum to a contiguous sequence, but none of that changes anything about the underlying.
- We are necessarily able to decompose every such hierarchical link back into a flat organization. The more complex the organization, the more hoops we will have to hop through in order to do so, but it must be possible. Indeed, this also applies to ‘real mode’ on x86.
- Otherwise the mapping of links to locations would not be bijective, as reliable execution -- at the system level -- demands that it MUST be.
- multiple addresses must not map to singular memory locations, and
- singular addresses must never map to multiple memory locations.
Bringing us to the further twist that turns the conundrum into such a fascinatingly complicated tangle. Above, it was expedient to suggest that pointers are addresses, for the sake of simplicity and clarity. Of course, this is not correct. A pointer is not an address; a pointer is a reference to an address, it contains an address. Like the envelope sports a reference to the house. Contemplating this may lead you to glimpse what was meant with the suggestion of recursion contained in the concept. Still; we have only so many words, and talking about the addresses of references to addresses and such, soon stalls most brains at an invalid op-code exception. And for the most part, intent is readily garnered from context, so let us return to the street.
Postal workers in this imaginary city of ours are much like the ones we find in the ‘real’ world. No one is likely to suffer a stroke when you talk or enquire about an invalid address, but every last one will balk when you ask them to act on that information.
Suppose there are only 20 houses on our singular street. Further pretend that some misguided, or dyslexic soul has directed a letter, a very important one, to number 71. Now, we can ask our carrier Frank, whether there is such an address, and he will simply and calmly report: no. We can even expect him to estimate how far outside the street this location would lie if it did exist: roughly 2.5 times further than the end. None of this will cause him any exasperation. However, if we were to ask him to deliver this letter, or to pick up an item from that place, he is likely to be quite frank about his displeasure, and refusal to comply.
Pointers are just addresses, and addresses are just numbers.
Verify the output of the following:
void foo( void *p ) {
printf(“%p\t%zu\t%d\n”, p, (size_t)p, p == (size_t)p);
}
Call it on as many pointers as you like, valid or not. Please do post your findings if it fails on your platform, or your (contemporary) compiler complains.
Now, because pointers are simply numbers, it is inevitably valid to compare them. In one sense this is precisely what your teacher is demonstrating. All of the following statements are perfectly valid -- and proper! -- C, and when compiled will run without encountering problems, even though neither pointer need be initialized and the values they contain therefore may be undefined:
- We are only calculating
result
explicitly for the sake of clarity, and printing it to force the compiler to compute what would otherwise be redundant, dead code.
void foo( size_t *a, size_t *b ) {
size_t result;
result = (size_t)a;
printf(“%zu\n”, result);
result = a == b;
printf(“%zu\n”, result);
result = a < b;
printf(“%zu\n”, result);
result = a - b;
printf(“%zu\n”, result);
}
Of course, the program is ill-formed when either a or b is undefined (read: not properly initialized) at the point of testing, but that is utterly irrelevant to this part of our discussion. These snippets, as too the following statements, are guaranteed -- by the ‘standard’ -- to compile and run flawlessly, notwithstanding the IN-validity of any pointer involved.
Problems only arise when an invalid pointer is dereferenced. When we ask Frank to pick up or deliver at the invalid, non-existent address.
Given any arbitrary pointer:
int *p;
While this statement must compile and run:
printf(“%p”, p);
... as must this:
size_t foo( int *p ) { return (size_t)p; }
... the following two, in stark contrast, will still readily compile, but fail in execution unless the pointer is valid -- by which we here merely mean that it references an address to which the present application has been granted access:
printf(“%p”, *p);
size_t foo( int *p ) { return *p; }
How subtle the change? The distinction lies in the difference between the value of the pointer -- which is the address, and the value of the contents: of the house at that number. No problem arises until the pointer is dereferenced; until an attempt is made to access the address it links to. In trying to deliver or pick up the package beyond the stretch of the road...
By extension, the same principle necessarily applies to more complex examples, including the aforementioned need to establish the requisite validity:
int* validate( int *p, int *head, int *tail ) {
return p >= head && p <= tail ? p : NULL;
}
Relational comparison and arithmetic offer identical utility to testing equivalence, and are equivalently valid -- in principle. However, what the results of such computation would signify, is a different matter entirely -- and precisely the issue addressed by the quotations you included.
In C, an array is a contiguous buffer, an uninterrupted linear series of memory locations. Comparison and arithmetic applied to pointers that reference locations within such a singular series are naturally, and obviously meaningful in relation both to each other, and to this ‘array’ (which is simply identified by the base). Precisely the same applies to every block allocated through malloc
, or sbrk
. Because these relationships are implicit, the compiler is able to establish valid relationships between them, and therefore can be confident that calculations will provide the answers anticipated.
Performing similar gymnastics on pointers that reference distinct blocks or arrays do not offer any such inherent, and apparent utility. The more so since whatever relation exists at one moment may be invalidated by a reallocation that follows, wherein that is highly likely to change, even be inverted. In such instances the compiler is unable to obtain the necessary information to establish the confidence it had in the previous situation.
You, however, as the programmer, may have such knowledge! And in some instances are obliged to exploit that.
There ARE, therefore, circumstances in which EVEN THIS is entirely VALID and perfectly PROPER.
In fact, that is exactly what malloc
itself has to do internally when time comes to try merging reclaimed blocks -- on the vast majority of architectures. The same is true for the operating system allocator, like that behind sbrk
; if more obviously, frequently, on more disparate entities, more critically -- and relevant also on platforms where this malloc
may not be. And how many of those are not written in C?
The validity, security and success of an action is inevitably the consequence of the level of insight upon which it is premised and applied.
In the quotes you have offered, Kernighan and Ritchie are addressing a closely related, but nonetheless separate issue. They are defining the limitations of the language, and explaining how you may exploit the capabilities of the compiler to protect you by at least detecting potentially erroneous constructs. They are describing the lengths the mechanism is able -- is designed -- to go to in order to assist you in your programming task. The compiler is your servant, you are the master. A wise master, however, is one that is intimately familiar with the capabilities of his various servants.
Within this context, undefined behaviour serves to indicate potential danger and the possibility of harm; not to imply imminent, irreversible doom, or the end of the world as we know it. It simply means that we -- ‘meaning the compiler’ -- are not able to make any conjecture about what this thing may be, or represent and for this reason we choose to wash our hands of the matter. We will not be held accountable for any misadventure that may result from the use, or mis-use of this facility.
In effect, it simply says: ‘Beyond this point, cowboy: you are on your own...’
Your professor is seeking to demonstrate the finer nuances to you.
Notice what great care they have taken in crafting their example; and how brittle it still is. By taking the address of a
, in
p[0].p0 = &a;
the compiler is coerced into allocating actual storage for the variable, rather than placing it in a register. It being an automatic variable, however, the programmer has no control over where that is assigned, and so unable to make any valid conjecture about what would follow it. Which is why a
must be set equal to zero for the code to work as expected.
Merely changing this line:
char a = 0;
to this:
char a = 1; // or ANY other value than 0
causes the behaviour of the program to become undefined. At minimum, the first answer will now be 1; but the problem is far more sinister.
Now the code is inviting of disaster.
While still perfectly valid and even conforming to the standard, it now is ill-formed and although sure to compile, may fail in execution on various grounds. For now there are multiple problems -- none of which the compiler is able to recognize.
strcpy
will start at the address of a
, and proceed beyond this to consume -- and transfer -- byte after byte, until it encounters a null.
The p1
pointer has been initialized to a block of exactly 10 bytes.
If a
happens to be placed at the end of a block and the process has no access to what follows, the very next read -- of p0[1] -- will elicit a segfault. This scenario is unlikely on the x86 architecture, but possible.
If the area beyond the address of a
is accessible, no read error will occur, but the program still is not saved from misfortune.
If a zero byte happens to occur within the ten starting at the address of a
, it may still survive, for then strcpy
will stop and at least we will not suffer a write violation.
If it is not faulted for reading amiss, but no zero byte occurs in this span of 10, strcpy
will continue and attempt to write beyond the block allocated by malloc
.
If this area is not owned by the process, the segfault should immediately be triggered.
The still more disastrous -- and subtle --- situation arises when the following block is owned by the process, for then the error cannot be detected, no signal can be raised, and so it may ‘appear’ still to ‘work’, while it actually will be overwriting other data, your allocator’s management structures, or even code (in certain operating environments).
This is why pointer related bugs can be so hard to track. Imagine these lines buried deep within thousands of lines of intricately related code, that someone else has written, and you are directed to delve through.
Nevertheless, the program must still compile, for it remains perfectly valid and standard conformant C.
These kinds of errors, no standard and no compiler can protect the unwary against. I imagine that is exactly what they are intending to teach you.
Paranoid people constantly seek to change the nature of C to dispose of these problematic possibilities and so save us from ourselves; but that is disingenuous. This is the responsibility we are obliged to accept when we choose to pursue the power and obtain the liberty that more direct and comprehensive control of the machine offers us. Promoters and pursuers of perfection in performance will never accept anything less.
Portability and the generality it represents is a fundamentally separate consideration and all that the standard seeks to address:
This document specifies the form and establishes the interpretation of programs expressed in the programming language C. Its purpose is to promote portability, reliability, maintainability, and efficient execution of C language programs on a variety of computing systems.
Which is why it is perfectly proper to keep it distinct from the definition and technical specification of the language itself. Contrary to what many seem to believe generality is antithetical to exceptional and exemplary.
To conclude:
- Examining and manipulating pointers themselves is invariably valid and often fruitful. Interpretation of the results, may, or may not be meaningful, but calamity is never invited until the pointer is dereferenced; until an attempt is made to access the address linked to.
Were this not true, programming as we know it -- and love it -- would not have been possible.
C
with what is safe inC
. Comparing two pointers to the same type can always be done (checking for equality, for example) however, using pointer arithmetic and comparing>
and<
is only safe when used within a given array (or memory block). – Divorcement>
and<
can be done, but it's not safe right? – Tobackstrcpy(px, pt);
invoke UB, aspt
is not aNUL
terminated string? – Verdureintptr_t
oruintptr_t
before comparing. On a machine with a flat memory model, that will actually do what you want. – Modicumint p[3],*q=p+1,*r=p+2; uintptr_t p1 = (uintptr_t)p, p2=(uintptr_t)p, q1=(uintptr_t)q, q2=(uintptr_t)r;
, the values of theuintptr_t
values could be rankedp1 < q1 < r1 < p2
, and such a thing could plausibly happen on a 32-bit segmented-mode compiler for the 80386 even though pointer comparisons amongp
,q
, andr
would be transitively ranked. – Lillamalloc
, which you can't predict without knowing the internal state of the memory allocator at the time of execution. Maybe it's a trick question requiring knowledge of local vs heap regions on Ubuntu, and the strings themselves are just a red herring? – Caliclept
contains an address that is "greater" thanpx
? Maybe, but most often, maybe not. – Illimaniif(p1<p2){}
could safely exist in a portable program; the only diff would be whether the empty if body executed. But UB means your whole program could crash, or literally anything. – Modicumintptr_t
. Comparingintptr_t
is always safe (assuming pointers don't convert to trap representations). In most implementations it will be the same as actual pointer comparison, but in the ISO C abstract machine it avoids UB. – Modicumintptr_t
oruintptr_t
may be meaningless even in cases where the Standard would have defined the behavior of a pointer comparison. Are you aware of any commercially-designed compilers which aggressively interpret the Standard as an excuse to process actions whose natural platform behavior would have no side effects in such a way as to arbitrarily disrupt the behavior of surrounding code, or is such behavior unique to clang, gcc, or compilers derived from them? – Lillaextern int x[],y[]; void test(int i) { y[0] = 1; int *p = y+i; if (p == x+10) *p=2; return y[0];}
, clang will replace the write to*p
with a write tox[10]
, but then ignore the possibility that the resulting access tox[10]
(which in source code was an access toy[i]
!) might affecty[0]
. – Lilla