Union common initial sequence with primitive
Asked Answered
P

1

21

I am trying to better understand a rather surprising discovery regarding unions and the common initial sequence rule. The common initial sequence rule says (class.mem 23):

 In a standard-layout union with an active member of struct type T1, it is permitted to read a non-static data member m of another union member of struct type T2 provided m is part of the common initial sequence of T1 and T2; the behavior is as if the corresponding member of T1 were nominated.

So, given:

struct A {
  int a;
  double x;
};

struct B {
  int b;
};

union U {
  A a;
  B b;
};

U u;
u.a = A{};
int i = u.b.b;

This is defined behavior and i should have the value 0 (because A and B have a CIS of their first member, an int). So far, so good. The confusing thing is that if B is replaced by simply an int:

union U {
  A a;
  int b;
};

...
int i = u.b;

According to the definition of common initial sequence:

The common initial sequence of two standard-layout struct types is...

So CISs can only apply between two standard-layout structs. And in turn:

A standard-layout struct is a standard-layout class defined with the class-key struct or the class-key class.

So a primitive type very definitely does not qualify; that is it cannot have a CIS with anything, so A has no CIS with an int. Therefore the standard says that the first example is defined behavior, but the second is UB. This simply does not make any sense to me at all; the compiler intuitively is at least as restricted with a primitive type as with a class. If this is intentional, is there any rhyme or reason (perhaps alignment related) as to why this makes sense? Is it possibly a defect?

Puffin answered 27/4, 2017 at 10:50 Comment(15)
IIRC, notes are non-normative, and just serve to clarify what is already said in other ways in the main text. So, you should take notes' wording with a grain of salt.Mistiemistime
@Mistiemistime I'm on my phone now, but there is a similar statement nearby in the standard that is not a note. I can replace it when at a computer, or am happy for someone to edit the question. Thanks for pointing that out.Puffin
@NirFriedman: "If this is intentional" Define "intentional". This is clearly what the standard says, so that's... what the standard says. Did the committee intend specifically to prevent such things? Who knows, but the common initial sequence rules aren't exactly new.Settles
@NicolBolas I'm going with the dictionary definition of intended. Or if you prefer the negation: is it possible this was an oversight? Frankly this state of affairs is so incredibly bizarre that I personally see no explanation other than oversight (and maybe, lack of caring: type punning is UB but the compilers support it and people use it anyway), but maybe someone else can (and that's who I'm hoping will answer this question).Puffin
This isn't C++-specific. C has the same rule, dating all the way back to C89.Madewell
@NirFriedman: note that the CIS rule is not about type punning, it is intended to provide polymorphism. Type punning in C is provided by the effective type rules and the rules on accessing unions, and in C++ AFAIK is not allowed via unions, but via memcpy. Probably the C89 rationale has the full explanation.Mistiemistime
I doubt the reason is alignment-related. Section 6.7.2.1 of the new C Standard includes this statement: 1424 There may be unnamed padding within a structure object, but not at its beginning. So both struct A { int a; ... } and int should begin with the same memory layout.Celanese
Initial sequence is an attribute of a compound type. Fundamental (aka primitive) types do not have "sequence" and therefore do not have common initial sequence. Anyway, the behavior is perfectly defined in both cases due to 3.10.10.6 and 9.2.20.Benniebenning
@NirFriedman: The C89 rationale notes that it would be possible for an implementation to be conforming but of such poor quality to be useless, but expects that the easiest way to meet the stated requirements will also naturally meet unstated ones. It thus makes no effort to explicitly define behaviors in all cases where all implementations to date have behaved consistently and there's no reason to expect that implementations might do anything else. If the first member of a struct have the same address as the struct, and all members of a union have the same address,...Baroda
...then in the absence of aliasing rules (which seem to have been a late addition) no need to explicitly mandate behavior in the circumstance you describe since the matching addresses would, in the absence of aliasing rules, imply the behavior. What's unfortunate is that the aliasing rules are interpreted as a complete list of all the cases where quality compilers should recognize aliasing, even though there is no evidence that they were intended to revoke useful behavioral guarantees that were implied by other parts of the Standard.Baroda
@Baroda Very interesting observations, thank you.Puffin
@supercat: Iirc in the begining of C a valid implementation of union was to use a struct. As in put all the members of the union one after another in memory without overlap. So members of a union didn't have to have the same address. Obviously any sane compiler programmer made a union have members overlapping to save memory. The CSI rule makes it clear that that is required. In a union the different members must overlap. (At least for the part that are common.)Coliseum
@GoswinvonBrederlow: In C as it existed in 1974, there was no such thing as a union, but any struct tag could be used to access the members of any struct. There was no need to declare union types, since any struct type could be used as a union containing all smaller ones. The CIS guarantee goes back to that time.Baroda
Even guys at language standard committee sometimes forget that in C/C++, primitive types are not class.Hibbard
See also https://mcmap.net/q/661491/-do-scalar-members-in-a-union-count-towards-the-common-initial-sequence/8586227, which merely asks whether this is allowed, not why not.Vogler
V
1

ninjalj has it right: the purpose of this rule is to support tagged unions (with tags at the front). While one of the types supported in such a union could be stateless (aside from the tag), this case can be trivially addressed by making a struct containing just the tag. Thus, there is no need for extending the rule beyond structs, and by default such exceptions to undefined behavior (akin to strict aliasing in this case) should be kept to a minimum for the usual reasons of optimization and flexibility of future standardization.

Vogler answered 9/3, 2018 at 20:10 Comment(3)
Do you have a source? I read ninjalis comment; I thought it was interesting at the time, but without a source I'm not sure if I can consider it definitive. Also, I don't see any real reason to do the tagged union this way. You can simply do the tagged union as a struct where the first member is the tag, and the second member is a union. This not only avoids the need for CIS, but is cleaner anyhow, and avoids the need for an exception to UB rules.Puffin
I don't have any written rationale to hand, but I would expect (in classic C) to see something like struct A {int tag/*=0*/; /*...*/}; struct B {int tag/*=1*/; /* ... */}; union U {struct A a; struct B b;}; void f(A*); void g(U *u) {if(u->a.tag==0) f(u); /*...*/}, with an implicit (A*)u if not simply u->tag! The rule was written, I believe, to cover such cases that predated it, even though other strategies could exist.Vogler
I don't see any reason why you would choose to use that over struct a_or_b { int tag; union { A a; B b; } value; } or something similar. Even if one or the other is slightly better stylistically, your own argument heavily weighs minimizing exceptions to UB, which would seem to outweigh any mild stylistic concerns. At any rate I'll +1 for making me aware of this.Puffin

© 2022 - 2024 — McMap. All rights reserved.