Efficient conversion data one integer type to another with the same representation
Asked Answered
W

2

12

The majority of microcomputer C compilers have two signed integer types with the same size and representation, along with two such unsigned types. If int is 16 bits, its representation will generally match short; if long is 64 bits, it will generally match long long; otherwise, int and long will usually have matching 32-bit representations.

If, on a platform where long, long long, and int64_t have the same representation, one needs to pass a buffer to three API functions in order (assume the APIs under the control of someone else and use the indicated types; if the functions could readily be changed, they could simply be changed to use the same type throughout).

void fill_array(long *dat, int size);
void munge_array(int64_t *dat, int size);
void output_array(long long *dat, int size);

is there any efficient standard-compliant way of allowing all three functions to use the same buffer without requiring that all of the data be copied between function calls? I doubt the authors of C's aliasing rules intended that such a thing should be difficult, but it is fashionable for "modern" compilers to assume that nothing written via long* will be read via long long*, even when those types have the same representation. Further, while int64_t will generally be the same as either long or long long, implementations are inconsistent as to which.

On compilers that don't aggressively pursue type-based aliasing through function calls, one could simply cast pointers to the proper types, perhaps including a static assertion to ensure that all types have the same size. The problem is that if a compiler like gcc, after expanding out function calls, sees that some storage is written as long and later read as long, without any intervening writes of type long, it may replace the later read with the value written as type long, even if there were intervening writes of type long long.

Disabling type-based aliasing altogether is of course one approach to making such code work. Any decent compiler should allow that, and it will avoid many other possible pitfalls. Still, it seems like there should be a Standard- defined way to perform such a task efficiently. Is there?

Washington answered 8/9, 2016 at 16:59 Comment(11)
Perhaps long long datll[size]; fill_array(MY_LLP_LP(datll), size); and let the macro check/handle the conversions?Lavation
@chux: What would you be expecting the macro to do? Bear in mind that zero machine instructions should be necessary to perform the conversions--the only requirement is that it prevent the compiler from "optimizing" later functions to not see the data written by earlier ones.Washington
How common is it to pass a buffer as a pointer to some integer type? Isn't char* or void* much more common? What library are you encountering with this profile? Also, wouldn't it be trivial to wrap the functions with a standard type? (ie: thunking)Sistrunk
Goal is not conversion of an integer type to another integer, but conversion of an integer pointer to another pointer. Suggest a macro to do that with a pointer cast. Use #if processing and _Static_assert as able to insure the simple cast will suffice.Lavation
@ebyrob, it's not unheard of to pass around buffers of elements of declared type other than character types. It's less common in code that aims to be highly portable, but even in codes such as those, buffers of explicit-width types such as int64_t are sometimes seen. It all depends on what you're trying to represent.Lancelancelet
@chux: Goal is to make integers which were written using one type readable with another pointer type that has the same representation. Both gcc and clang interpret pointer-aliasing rules as allowing compilers to ignore aliasing between different integer types, even when they have the same representation.Washington
@supercat: I'm more than a little hazy on the semantics of the C aliasing rules but I don't suppose there's any chance that a dummy in-place memmove of the buffer followed by casting would be well-defined and optimized by "decent" compilers?Leonardaleonardi
@doynax: If the destination of memcpy or memmove has no declared type, its Effective Type will become that of the source--a rule whose primary effect is to make memcpy/memmove useless for scrubbing effective types. Compilers would probably also be entitled to apply the same effective-type transference to something like for (size_t i=0; i<size; i++) ((char*)buff)[i] = ((char*)buff[i]);, though it's not quite clear what the phrase "through a character array" is really supposed to mean.Washington
@doynax: And incidentally, an attempt to use memmove to copy an object to itself will indeed be stripped out by gcc while leaving the Effective Type of the buffer unchanged.Washington
@supercat: Grumble.. Did the standard body just decide to sit down one day and ask themselves how they might define the language to maximize the number of subtle and insidious bugs through undefined behavior, while creating the absolute minimum number of escape hatches possible and documenting the intended well-defined uses as unclearly as possible? Honestly, how hard would it have been to define some function or specifier to declare intentional type punning to the compiler and reader..Leonardaleonardi
@doynax: C89 is a decent spec if one recognizes that it does not claim to specify everything necessary to make something be a quality implementation for any particular platform, and recognizes that if programmers could expect quality pre-C89 implementations for a given platform to behave a certain way, they should be able to expect quality C89 implementations for that platform to do likewise except when absolutely forbidden (e.g. even if pre-C89 implementations promoted 8-bit unsigned char to 16-bit unsigned int, C89 implementations would be required to promote to signed int).Washington
L
6

is there any efficient standard-compliant way of allowing all three functions to use the same buffer without requiring that all of the data be copied between function calls? I doubt the authors of C's aliasing rules intended that such a thing should be difficult, but it is fashionable for "modern" compilers to assume that nothing written via long* will be read via long long*, even when those types have the same representation.

C specifies that long and long long are different types, even if they have the same representation. Regardless of representation, they are not "compatible types" in the sense defined by the standard. Therefore, the strict aliasing rule (C2011 6.5/7) applies: an object having effective type long shall not have its stored value accessed by an lvalue of type long long, and vise versa. Therefore, whatever is the effective type of your buffer, your program exhibits undefined behavior if it accesses elements both as type long and as type long long.

Whereas I concur that the authors of the standard did not intend that what you describe should be hard, they also have no particular intention to make it easy. They are concerned above all with defining program behavior in a way that as much as possible is invariant with respect to all of the freedoms allowed to implementations, and among those freedoms is that long long can have a different representation than does long. Therefore, no program that relies on them having the same representation can be strictly conforming, regardless of the nature or context of that reliance.

Still, it seems like there should be a Standard- defined way to perform such a task efficiently. Is there?

No. The effective type of the buffer is its declared type if it has one, or otherwise is defined by the manner in which its stored value was set. In the latter case, that might change if a different value is written, but any given value has only one effective type. Whatever its effective type is, the strict aliasing rule does not allow for the value to be accessed via lvalues both of type long and of type long long. Period.

Disabling type-based aliasing altogether is of course one approach to making such code work. Any decent compiler should allow that, and it will avoid many other possible pitfalls.

Indeed, that or some other implementation-specific approach, possibly including It Just Works, are your only alternatives for sharing the same data among the three functions you present without copying.

Update:

Under some restricted circumstances there may be a somewhat more standard-based solution. For example, with the specific API functions you designated, you could do something like this:

union buffer {
    long       l[BUFFER_SIZE];
    long long ll[BUFFER_SIZE];
    int64_t  i64[BUFFER_SIZE]; 
} my_buffer;

fill_array(my_buffer.l, BUFFER_SIZE);
munge_array(my_buffer.i64, BUFFER_SIZE);
output_array(my_buffer.ll, BUFFER_SIZE);

(Props to @Riley for giving me this idea, though it differs a bit from his.)

Of course that doesn't work if your API dynamically allocates the buffer itself. Note, too, that

  • A program using that approach may conform to the standard, but if it assumes the same representation for long, long long, and int64_t then it still does not strictly conform, as the standard defines that term.

  • The standard is a bit inconsistent on this point. Its remarks about allowing type punning via a union are in a footnote, and the footnotes are non-normative. The reinterpretation described in that footnote seems to contradict paragraph 6.5/7, which is normative. I prefer to keep my mission-critical code far away from uncertainties such as this, for even if we conclude that this approach should work, the uncertainty provides just the kind of cranny that compiler bugs like to lodge in.

  • A rather well-known figure in the field once had this to say about the issue:

Unions are not useful [for aliasing], regardless of what silly language lawyers say, since they are not a generic method. Unions only work for trivial and largely uninteresting cases, and it doesn't matter what C99 says about the issue, since that nasty thing called "real life" interferes.

Lancelancelet answered 8/9, 2016 at 18:33 Comment(16)
I do not dispute that the Standard allows for the possibility that an implementation could document different representations for long and long long, even if both types had the same size. On many implementations, however, the representations of long and long long are documented and they match. The question is whether there is any way to exchange the data without relying upon anything beyond the documented representations.Washington
@supercat, I have answered that question. No. The rest of the answer is a discussion of what parts of the standard yield that conclusion, and of why the standard does not provide a mechanism such as you are looking for.Lancelancelet
So is the only way to make the code portable to write a silly loop which reads each word as one type and then writes the same data back as another, and hope that the compiler manages to omit the instructions which would do the loads and stores, but still reliably recognize that the effective type has changed (gcc 6.2 sometimes omits such load/store operations but fails to recognize that the effective type changes).Washington
@supercat, if you are stuck with a combination of interfaces such as you describe, and you are willing to rely on the representations of long and long long to be the same in your chosen implementation (which presumably you can check in its documentation), then I don't see what's to be gained by avoiding further reliance on your implementation's specific features. With GCC, for instance, I'd consider just casting the pointers and turning on -fno-strict-aliasing if the type representations really did match.Lancelancelet
@Washington Type punning is allowed using a union. See my answerGridley
@Riley: Compilers like gcc will only recognize type punning through a union if the lvalue accesses use the union type directly. Taking the addresses of union members and then using those as pointers to the individual member types won't work.Washington
@Washington gcc doesn't give any errors, and my (basic) test worked properly with the code in my answer (I have the functions just print out the value passed in). What else would be the problem?Gridley
@Riley: See godbolt.org/g/S1k9E9 for a demonstration of gcc 6.2's failure to recognize aliasing of accesses to arrays that are part of a union.Washington
@Washington My assembly is a little rusty. What's the problem?Gridley
@Riley: The code for test3 is pretty simple: return 1 unconditionally, even though it would return 3 if the compiler recognized the aliasing between the storage at p->v1 and p->v2.Washington
@Washington I thought it was weird that it never called blah3. Does it see p->v1 and p->v2 as two different things, so it can optimize away all of the calls because p->v1 is only every assigned 1?Gridley
@Riley: That's precisely the problem. As far as I'm concerned, gcc's default mode implements a subset of Dennis Ritchie's language, and is unsuitable for any code that will ever need to reuse storage without going through a malloc/free cycle (as of 6.2 it's not reliable if storage gets uses as long and then as long long, even if storage is never read using any type other than the one with which it was written).Washington
@JohnBollinger: Did you see the godbolt link? Beyond the fact that such a pattern would require specifying a hard-coded maximum buffer size, gcc 6.2 doesn't recognize the references to l, ll, or i64 as changing the active member of the array.Washington
@JohnBollinger: It may be fair to note, with regard to the Linus Torvalds quote, that he wasn't saying unions are generally useless, but rather that it is generally not practical to encapsulate everything that might alias within an actual union object.Washington
@supercat, with respect to gcc 6.2, then, it seems Linus was right. I have edited my answer a bit to clarify the context of the quotation.Lancelancelet
@JohnBollinger: I like your edits there. Interestingly, there are two ways "real life" intervenes: real-life data formats can often not be mapped to unions, and real compilers like gcc don't always work even when using unions (and even memmove!)Washington
U
0

You can try doing it with macros. The sizeof operator is not available to the C preprocessor, but you can compare INT_MAX:

#include <limits.h>

#if UINT_MAX == USHRT_MAX
#  define INT_BUFFER ((unsigned*)short_buffer)
#elif UINT_MAX == ULONG_MAX
#  define INT_BUFFER ((unsigned*)long_buffer)
#elif UINT_MAX == ULLONG_MAX
#  define INT_BUFFER ((unsigned*)long_long_buffer)
#else /* Fallback. */
  extern unsigned int_buffer[BUFFER_SIZE];
#  define INT_BUFFER int_buffer
#endif

This is a C question, but in C++, you could do this in a fancier way with template specialization and the type trait templates.

Unprincipled answered 8/9, 2016 at 21:44 Comment(4)
The difficulty is that "modern" C compilers will assume that if one function accesses some storage using a pointer of type long* and another accesses storage using a long long*, the functions can't possibly be accessing the same storage, even if the types have the same layout and representation, and even if it should be obvious to the compiler that aliasing would be likely.Washington
@Washington Fair enough, although void* might work for that. The correct way to type-pun like this is with a union anyway.Unprincipled
Using void* doesn't help, since the problem isn't one of ensuring that compilers allow the syntax, but ensuring that they don't use the aliasing rule to justify assumptions that writes to one pointer won't affect the target of another.Washington
I’m pretty sure most compilers can tell that (int*)p and (long*)p are aliases, but a specific example might help. In a single-threaded program, not intermixing aliases might be your solution, and of course a multi-threaded program sharing this data needs a more robust solution anyway.Unprincipled

© 2022 - 2024 — McMap. All rights reserved.