cpointerslanguage-lawyerunionsstrict-aliasing

How do initial members, common initial sequences, anonymous unions, and strict aliasing interact in C?


So there are several things that are clearly allowed under the strict aliasing rules (for clarity, lets do this in C23):

The first and most obvious is that structs are allowed to alias with pointers to their initial members:

typedef struct {
  int data;
} parent;

typedef struct {
  parent _base;
  int metadata;
} child;

int main() {
  child child_obj = {};
  parent* parent_ptr = (parent*) &child_obj; // Fine to read and write
  int* int_ptr = (int*) parent_ptr; // Also fine
}

The second is that unions of objects with common initial sequences allow access with those common elements, and a pointer to a union is freely convertible to/accessible as a pointer to any of its elements.

typedef struct {
  int type_id;
  size_t size;
  char* buffer;
} dynamic_string;

typedef struct {
  int type_id;
  size_t number;
} just_a_number;

typedef union {
  dynamic_string dystr;
  just_a_number num;
  struct {
    int type_id;
    char fixed_string[32];
  };
} string_or_num;

int main() {
  string_or_num obj = {};

  if(obj.type_id == 2) // Fine, common initial sequence
    memcpy("Hello World!", obj.fixed_string, 13);

  // Fine
  dynamic_string* ds_ptr = (dynamic_string*) &obj;
  just_a_number* num_ptr = (just_a_number*) &obj;

  // Also fine, pointer to initial common member
  int* int_ptr = (int*) &obj;
}

Intuitively, I think I can combine these into something like the following. However, I'm not confident enough in my standardese to say with 100% certainty it is kosher

typedef struct {
  int type_id;
  char data[4];
} parent_a;

typedef struct {
  int type_id;
  float decimal;
} parent_b;

// No initial sequence
typedef struct {
  double ccccombo_breaker;
} parent_c;

typedef struct {
  union {
    parent_a _base_a;
    parent_b _base_b;
    parent_c _base_c;
  };
  int look_at_me_ive_got_three_parents;
} child;

int main() {
  child child_obj = {};

  // Are these kosher?
  parent_a* a_ptr = (parent_a*) &child_obj;
  parent_b* b_ptr = (parent_b*) &child_obj;

  // How about this?
  parent_c* c_ptr = (parent_c*) &child_obj;
  double* db_ptr = (double*) &child_obj;
}

To be clear I'm not asking if something like parent_c is a good idea, just what the standard says about it. Would reads and writes through these pointers be following the aliasing rules?

Bonus points if you have exact language from the standard or a combination of standard sections that make a compelling case.


Solution

  • These are separate but mostly compatible rules:

    Additionally, there's the rule about "union type punning" (6.5.2.3 §3) which allows a member of a union to be converted/expressed as a different type, although this may invoke all manner of poorly-defined behavior in case of misalignment, out of range values, invalid/trap representations and so on.

    Your question is mainly about the first of these rules:

    parent_a* a_ptr = (parent_a*) &child_obj;
    parent_b* b_ptr = (parent_b*) &child_obj;
    parent_c* c_ptr = (parent_c*) &child_obj;
    

    These are fine as per that "any member of union" rule. An anonymous struct/union means that any of _base_a, _base_b and _base_c is "any member of a union" and therefore the pointer type is "suitably converted" by the cast. That these types happen to have a common initial sequence further down isn't really relevant. We can have a union of wildly incompatible types. Potential problems can only arise when accessing the actual data through a potentially non-compatible type or a type which is not an alias.

    double* db_ptr = (double*) &child_obj; is however a bit questionable since the first object of child_obj is not a double. The pointer conversion itself is almost always fine, C allows pretty much any crazy conversion between object pointers (6.3.2.3 §7).

    But if you de-reference db_ptr later, then you are on more questionable territory - the "pointer to any member of union" rule doesn't apply so it becomes a question of strict aliasing. Which in turn doesn't object of doing a double lvalue access to something that is potentially a double. And if the binary contents stored there (all zeroes) can also be represented as a double, then everything is in theory fine.

    Notably, the history of real-world compiler implementations of these rules isn't very pretty (particularly not strict aliasing and common initial sequence). Lots of things are left unclear by the standard and it is better not to rely on whatever the standard says/seems to say, because that's not necessarily how one particular compiler interprets it. Plus some compilers do not even have the ambition to become a quality implementation. It is best practice not to trust the compiler to get any of this right. For example, the latest gcc 13.2 goes completely bananas when facing the mentioned "inspect byte by byte" rule which has been in C since at least C99.