Is it UB to compare two unrelated pointers using memcmp?
Asked Answered
D

3

5

It is undefined behavior in C to perform arithmetic comparison on two unrelated pointers, that is, two pointers that don't point to the same array or object.

int a, b;
bool ub = &a < &b;

One could, however, cast them to uintptr_t:

int a, b;
bool not_ub = (uintptr_t)&a < (uintptr_t)&b;

The cast is defined and the comparison is, too.

However, is it UB to compare the two pointers using memcmp?

int a, b;
int* pa = &a;
int* pb = &b;
int maybe_ub = memcmp(&pa, &pb, sizeof(int*));

C11 §7.24.4.1 says:

The memcmp function compares the first n characters of the object pointed to by s1 to the first n characters of the object pointed to by s2.

My understanding of that excerpt is that the representation of the pointers is being compared, not the pointers themselves. As such, I'd expect the call to memcmp not to exhibit any undefined behavior.

Since the standard does not specify how objects are stored or represented (except in specific cases, but not here), I am not interested in the result of memcmp, but simply in whether it is or is not UB.

Dues answered 1/2, 2024 at 14:33 Comment(2)
It is not UB, but it is also not exactly a pointer comparison, it is a comparison of the binary representation of two pointers.Handbreadth
@Handbreadth this is my understanding. This post is an attempt at settling a disagreement over the meaning of the C standard, as I was unable to be convinced that I was right or wrong by reading the standard itself.Dues
A
10

The behavior of memcmp(&pa, &pb, sizeof(int*)) is not undefined. We see that because it is defined by the text you quote, C 2018 7.24.4 2:

The memcmp function compares the first n characters of the object pointed to by s1 to the first n characters of the object pointed to by s2.

The objects pa and pb are composed of bytes, per C 2018 6.2.6.1 2:

Except for bit-fields, objects are composed of contiguous sequences of one or more bytes, the number, order, and encoding of which are either explicitly specified or implementation-defined.

And they have sizeof (int *) bytes, per C 2018 6.5.3.4 2:

The sizeof operator yields the size (in bytes) of its operand…

If the memcmp indicates the bytes are the same, pa and pb necessarily have the same value, as the bytes represent the value. If memcmp indicates the bytes are different, the C standard allows for either pa and pb to have different values or to have the same value (because one value may have multiple representations with different bytes).

(In most common C implementations today, a flat address space is used, and the bits in a pointer correspond directly to a hardware address. However, it may be that not all bits in a pointer are used for the address. A system might use only 48 bits for addresses in user-space processes, so 64-bit pointers in a C implementation might have 16 spare bits. Different values in those spare bits might not indicate different addresses. They could be used for other purposes or merely neglected by the compiler and allowed to “float.”)

Apparition answered 1/2, 2024 at 14:45 Comment(7)
"In most common C implementations today, a flat address space is used" True. "and the bits in a pointer correspond directly to a hardware address" Not so much. A lot of C programs deal in virtual addresses, and even on "bare metal" implementations, it's common for some of the bits to control cache bypass, etc, in addition to the ones that actually address the memory array. (This latter case is covered by your mention of "used for other purposes", although "spare" isn't a great description at that point)Cindy
As it is tagged language-lawyer it is an potential UBAbnaki
@BenVoigt: Virtual addresses are hardware addresses, meaning, while a C pointer might be some form of software abstraction, the virtual address is what is used in processor instructions.Apparition
@gulpr: It is emphatically not undefined behavior. Just an indeterminate result.Cindy
AVR programmers would not agree. For them, it is the most important implementation.Abnaki
@gulpr: There is no potential undefined behavior in the given memcmp(&pa, &pb, sizeof(int*)).Apparition
@Abnaki to confirm, I am specifically asking about undefined behavior in the sense of the C standard (as opposed to implementation-defined behavior and unspecified behavior both of which are not relevant here).Dues
P
3

It is undefined behavior in C to perform arithmetic comparison on two unrelated pointers, that is, two pointers that don't point to the same array or object.

More or less, supposing that by "arithmetic comparison" you mean relational expressions (<, <=, >=, or >). There are some additional combinations of operands for which these have defined behavior, however:

When two pointers are compared, [... if] both point one past the last element of the same array object [...;] pointers to structure members [of the same structure object ...;] pointers to members of the same union object [...; when] the expression P points to an element of an array object and the expression Q points to the last element of the same array object, the pointer expression[s] Q+1 [and] P.

[C17 6.5.8/5]

On the other hand, (in)equality tests are between pointers of compatible type are well defined as long as the values are determinate, regardless of what they point to:

The == (equal to) and != (not equal to) operators are analogous to the relational operators except for their lower precedence. [...] For any pair of operands [satisfying the constraints on such expressions], exactly one of the relations is true.

[C17 6.5.9/3, emphasis added]

You go on to say,

One could, however, cast them to uintptr_t:

int a, b;
bool not_ub = (uintptr_t)&a < (uintptr_t)&b;

The cast is defined and the comparison is, too.

Yes, but the significance of such an integer comparison with respect to the original pointers is not defined.

However, is it UB to compare the two pointers using memcmp?

No. The specifications for memcmp() place no limitations of their own on to what their arguments point, and a restriction such as you postulate would not serve the purposes of the function. Its description simply says:

The memcmp function compares the first n characters of the object pointed to by s1 to the first n characters of the object pointed to by s2.

A footnote calls out potential problems with comparing structure and union padding, or the contents of char arrays containing strings, past the string terminator, but nowhere in the function description is there any reason to suppose that there are special cases for pointers to any kind of objects, including pointer objects.

And why should there be? memcmp() is not about the semantic values of the objects to which its arguments point. It is about their representations. In that respect, note well that given valid pointers to compatible types p1 and p2, the expression p1 == p2 evaluating to 1 does not imply that memcpy(&p1, &p2, sizeof p1) will return 0. That is, two pointers to the same object can have different representations.

By the same token, nonzero results from memcmp() operating on a pair of pointers have no defined relationship with the relative locations in memory of the objects to which they point (not even under the assumption that the question is meaningful in the execution environment, which is not a given). For a rather pedestrian case, suppose that the environment represents pointers as 64-bit integers conveying indexes into a flat address space. For a given pair of distinct indices, the result from memcmp() would depend on the machine's endianness.

Penninite answered 1/2, 2024 at 16:27 Comment(0)
S
-1

There is no undefined behavior here. You can compare two pointers, regardless of whether they're related or pointing to the same object.

If you get two pointers to the same object, they will be identical. This is a valid method to check whether the two pointers point to the same object. You can not only compare pointers, but do pointer arithmetic with them. If you subtract two pointers and the result is 0, they point to the same object. If you add one to a pointer, it will point just past your object, if you subtract the address of the first (index #0) element of an array from the address of the 5th element (index #4), you get 4. If the pointers are "unrelated", you get a negative number, or a number greater than the size of your array, and then you know that it's invalid. All cases properly defined.

It's true, as mentioned in other answers, that two different pointers can point to the very same object, because there might be unused bits in the pointer, or special flags to control cache and so on. However, this is not a feature of C, this is a peculiarity of the underlying hardware, e.g. partial address decoding. You will never get pointers with different numerical values if you take the address of the same object twice with the & operator. You can modify the address "manually", e.g. to access the same object bypassing the cache, but as far as C is concerned that's a different address and a different object!

Virtual memory does not matter here. C does not know about virtual memory. Physical and virtual memory, memory-mapped I/O are all the same for C. There cannot be two objects at the same address.

Slavophile answered 1/2, 2024 at 18:36 Comment(26)
"If the pointers are "unrelated", you get a negative number, or a number greater than the size of your array, and then you know that it's invalid. All cases properly defined." Are you saying that subtracting two unrelated pointers is not UB?Dues
Yes. Subtracting two pointers cannot possibly be undefined.Slavophile
C11 6.5.6/9 disagrees: "When two pointers are subtracted, both shall point to elements of the same array object, or one past the last element of the array object", in other words subtraction of two pointers is defined iff the two pointers point to the same object/array (or one past the end). See also https://mcmap.net/q/940910/-when-is-pointer-subtraction-undefined-in-cDues
True, those are the "valid" values of pointers. You still get a number if you subtract two pointers, but it won't be a valid array index, that's all. Furthermore you can cast any integer into a pointer (e.g, through uintptr_t), without ever having any object at that address, and do arithmetic with them. The standard talks about pointers that are dereferenced. That might be problematic, but doing arithmetic with them cannot be undefined.Slavophile
There are... a few wrong things in here. Again, this whole thing is about what is considered defined or undefined behavior. Casting any integer into a pointer and doing arithmetic with it is something you can do with most compilers, but it will result in undefined behavior: "If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined." I'm not saying it won't give you a number. I'm saying the compiler is free to do anything.Dues
Re “If you subtract two pointers and the result is 0, they point to the same object”: This is false. C 2018 6.5.6 9 says “When two pointers are subtracted, both shall point to elements of the same array object, or one past the last element of the array object…” and 4 2 says “If a ‘shall’ or ‘shall not’ requirement that appears outside of a constraint or runtime-constraint is violated, the behavior is undefined…”Apparition
Re “If the pointers are ‘unrelated’, you get a negative number, or a number greater than the size of your array, and then you know that it's invalid. All cases properly defined.”: This also false, per above.Apparition
Re “However, this is not a feature of C, this is a peculiarity of the underlying hardware”: This is false. The final determination of the representations of pointers is made by the C implementation, not the hardware. The C standard does not require a C implementation to use hardware addresses for its pointers. It may use hardware addresses, it may use a modification of the addresses, it may use hardware addresses with additional information, it may use partial hardware addresses (limiting the address space), or it may synthesize its own virtual addresses.Apparition
Re “You can compare two pointers, regardless of whether they're related or pointing to the same object”: This is true for == and != but not for <, <=, >, and >=. C 2018 6.5.8 5 lists various cases for which the relational operators are defined for pointer comparisons, all involving the same object, one past it, or members of the same array, union, or structure, and then it says “In all other cases, the behavior is undefined.”Apparition
zdimension: Since C99 you have uintptr_t and its friends. This implicitly means that from C99 pointers MUST be numbers. Earlier that was not the case, a pointer could have been anything, e.g. a string. It was compiler dependent. Not anymore. Pointers can be "officially" casted into integers, and vice versa, as they're just numbers. As a consequence, the compiler is not free to do anything. It's properly defined, pointer arithmetic is just integer arithmetic.Slavophile
Eric: not sure what you meant. What I described, that you set one bit in the address to access the same memory bypassing the cache, as it's the case on architectures like the TriCore or the now obsolete AVR32B, is completely out of scope for C. Also, some range in the lower addresses, or the entire lower 2G is MMU-translated and might be virtual memory, while the upper addresses are for peripherals, is a HW behavior and out of scope for C.Slavophile
Eric: the undefined behavior comes from the assumption that addresses are "black box" objects, returned by malloc() (for example), and you can't possibly know whether two distinct allocations return lower or higher addresses. But addresses are also integers (typucally 32 or 64 bit) that are ALWAYS related in the 32/64 bit address space. You can (and do) have fixed addresses, you can create pointers from arbitrary integers that don't point to any object and do arithmetic with them.Slavophile
@vjalle: Re “Since C99 you have uintptr_t and its friends. This implicitly means that from C99 pointers MUST be numbers.”: No, it does not. It means a pointer must be encodable as a number. It does not mean they have the same semantics as numbers or that operations performed on the numbers will reflect operations on the pointer…Apparition
… For example, a pointer could be a combination of a base b and an offset o, (b, o), such that the physical address is actually 64•b+o, even though the offset o is 16 bits, while the conversion of the pointer to uintptr_t may produce 65536•b+o. Then the pointer (0x100, 0xffff) would represent address 0x13fff but converting it to uintptr_t would produce 0x100ffff. Adding 1 to the uintptr_t would produce 0x1010000, and converting it back to a pointer would produce (0x101, 0x00), which would represent address 0x4040, which is not the byte after address 0x13fff.Apparition
@vjalle: Re “But addresses are also integers (typucally 32 or 64 bit) that are ALWAYS related in the 32/64 bit address space”: No, they are not. Note this question is tagged language-lawyer. That means it is specifically about what the C standard technically says. What C implementations “typically” do is irrelevant. The C standard does not require any relationship between pointers and addresses other than the specific behaviors it describes. Further, even in implementations which uses hardware addresses for pointers, optimization by the compiler can break that relationship.Apparition
@EricPostpischil: which C dialect are you talking about? Since C99 pointers are no mysterious black-box objects, they are simple integers in disguise. Before that this was not stated in the C standard despite the fact that pointers were always integers under the hood on all practical architectures. You can consider the entire C memory space as a flat char array that is indexed with an intptr_t. Pointers are just an abstraction that hold both the address and the type. There is nothing mysterious about them.Slavophile
@vjalle: I cited the standard, C 2018. The specification is essentially the same from C 1990 to C 2018: Subtraction of pointers is not defined if the pointers do not point into the same object or array, including one past the end, and relational comparison of pointers is not defined if the pointers do not point into the same object, array, union, or structure, including one past the end. Conversions to uintptr_t, specified since 1999, specify only that conversion back to a pointer yields something equal to the original pointer. No other semantics are guaranteed (C 1999 7.18.1.4 1).Apparition
@vjalle: I have given you specific citations to the C standard that show that these things are not defined and that no guarantees are made that pointers behave like numbers outside the limited uses with the same object, same array, et cetera. You have given no citations to the contrary.Apparition
@EricPostpischil: I know all that, but the fact that C99 states that all pointers can be fully and unambiguously represented by integers has some interesting consequences. Integers are ordered, thus addresses/pointers are ordered too. All arithmetic that works on integers inherently applies to pointers. If something is not directly supported, you can cast to intptr_t, manipulate, cast it back. It's all in the C standard.Slavophile
@vjalle: Re “Integers are ordered, thus addresses/pointers are ordered too”: This is false because the standard does not specify that conversion of a pointer to uintptr_t always produces the same value. It only specifies that conversion back produces a pointer equal to the original…Apparition
… Re “All arithmetic that works on integers inherently applies to pointers”: This is false because the standard does not specify semantics for the integers that result from converting pointers other than that they can be converted back to produce values equal to the original pointers. It does not specify that any manipulations done to the integers will be reflected in the pointers that result from converting the changed integers back. Re “ If something is not directly supported, you can cast to intptr_t, manipulate, cast it back. It's all in the C standard.”: No, that is not in the standard…Apparition
… E.g., it is entirely conforming to the C standard that conversion of a pointer to an integer produces an integers whose bits are all reversed in position from the hardware address. In that case, adding 1 to the integer and then converting back to a pointer would not yield an address one byte beyond the original pointer…Apparition
… The only requirements the standard imposes for the pointer-to-uintptr_t and uintptr_t-to-pointer conversions are “the property that any valid pointer to void can be converted to this type, then converted back to pointer to void, and the result will compare equal to the original pointer” (C 2018 7.20.1.4 1), that an integer constant zero converts to a null pointer (6.3.2.3 2), and that conversion to an integer type is documented by the implementation (6.3.2.3 7). None of the claims you make appear in the standard.Apparition
Ok, I understand your concern. In that case we have an implementation-defined function that converts the numerical value of the pointer to the resulting integer. And there is the inverse function, too. You can apply the function, do some math, apply the reverse function, and convert it back to a pointer. Works on the theoretical target that inverts the bits when casting pointers to integers, too.Slavophile
Btw still no UB here only implementation defined, but whatever...Slavophile
@Slavophile "You can apply the function, do some math, apply the reverse function" you specifically can't; integer-to-pointer conversions is defined only if the integer was got through pointer-to-integer conversion. Any subsequent modification of the integer renders it ineligible for integer-to-pointer conversion. auto x = (uintptr_t)&thing; auto y = (int*)(x + 4); is thus wrong, because "x + 4" is not an integer that was obtained through conversion. Hence, UBDues

© 2022 - 2025 — McMap. All rights reserved.