Consider the below setup:
typedef struct
{
float d;
} InnerStruct;
typedef struct
{
InnerStruct **c;
} OuterStruct;
float TestFunc(OuterStruct *b)
{
float a = 0.0f;
for (int i = 0; i < 8; i++)
a += b->c[i]->d;
return a;
}
The for loop in TestFunc exactly replicates one in another function that I'm testing. Both loops are unrolled by gcc (4.9.2) but yield slightly different assembly after doing so.
Assembly for my test loop:ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤAssembly for the original loop:
lwz r9,-0x725C(r13) lwz r9,0x4(r3)
lwz r8,0x4(r9) lwz r8,0x8(r9)
lwz r10,0x0(r9) lwz r10,0x4(r9)
lwz r11,0x8(r9) lwz r11,0x0C(r9)
lwz r4,0x4(r8) lwz r3,0x4(r8)
lwz r10,0x4(r10) lwz r10,0x4(r10)
lwz r8,0x4(r11) lwz r0,0x4(r11)
lwz r11,0x0C(r9) lwz r11,0x10(r9)
efsadd r4,r4,r10 efsadd r3,r3,r10
lwz r10,0x10(r9) lwz r8,0x14(r9)
lwz r7,0x4(r11) lwz r10,0x4(r11)
lwz r11,0x14(r9) lwz r11,0x18(r9)
efsadd r4,r4,r8 efsadd r3,r3,r0
lwz r8,0x4(r10) lwz r0,0x4(r8)
lwz r10,0x4(r11) lwz r8,0x0(r9)
lwz r11,0x18(r9) lwz r11,0x4(r11)
efsadd r4,r4,r7 efsadd r3,r3,r10
lwz r9,0x1C(r9) lwz r10,0x1C(r9)
lwz r11,0x4(r11) lwz r9,0x4(r8)
lwz r9,0x4(r9) efsadd r3,r3,r0
efsadd r4,r4,r8 lwz r0,0x4(r10)
efsadd r4,r4,r10 efsadd r3,r3,r11
efsadd r4,r4,r11 efsadd r3,r3,r9
efsadd r4,r4,r9 efsadd r3,r3,r0
The issue is the float values these instructions return are not exactly the same. And I can't change the original loop. I need to modify the test loop somehow to return the same values. I believe the test's assembly is equivalent to just adding each element one after another. I'm not very familiar with assembly so I wasn't sure how the above differences translated into c. I know this is the issue because if I add a print to the loops, they don't unroll and the results match exactly as expected.
Disabling fast-math seems to fix this issue. Thanks to @njuffa for the suggestion. I was hoping to be able to design the test function around this optimization, but it doesn't seem to be possible. At least I know what the issue is now. Appreciate everyone's help on the problem!