Why doesn't this code produce the same assembly? (g++ -O3)
I know little of assembly but it seems case 2 accessing has less instructions, so should be preferred, right? I am asking this because I wanted to implement a wrapper class with an access operator that returns a pointer int* p = a[i]
(so access is a[i][j]
, instead of a[i*3+j]
), but don't know if it's worth it. Thank you for any help.
#include <iostream>
int main() {
int a[9];
int i, j, k;
// Case 1
std::cin >> i >> j >> k;
*(a + i*3 + j) = k;
std::cin >> i >> j >> k;
(&a[i*3])[j] = k;
std::cin >> i >> j >> k;
*((&a[i*3])+j) = k;
// Case 2
std::cin >> i >> j >> k;
a[i*3 + j] = k;
std::cout << a[0];
return 0;
}
https://godbolt.org/z/13arxcPqz
Edit: For completeness, this change where a
is moved to the right is exactly as in case 2, as the operator+ now associates left.
// Case 2 again
std::cin >> i >> j >> k;
*(i*3 + j + a) = k;
The expressions *(a + i*3 + j)
and a[i*3 + j]
are not equivalent at the level of C++. Since binary +
associates left-to-right, the former is equivalent to *((a + i*3) + j)
while the latter is equivalent to *(a + (i*3 + j))
. They can produce different results if, for instance, the sum in i*3 + j
would overflow int
.
For a concrete example, consider a 64-bit machine with 32-bit int
like your x86-64 system, and suppose we had i == 600'000'000
and j == 2'000'000'000
. Suppose, instead of your array of length 9, that a
points into an extremely big array on a 64-bit. The first expression adds 1'800'000'000
and then 2'000'000'000
to a
, yielding a+3'800'000'000
. The second adds 1'800'000'000+2'000'000'000
first, which overflows and causes undefined behavior. On some compilers, the behavior might be to "wrap around", yielding a+(-494'967'296)
, a completely different address that is 16 GB away from the other one.
The generated assembly reflects this distinction. In the second case, the addition i*3 + j
is done as plain 32-bit addition, which would wrap around on overflow. Since j
is in memory, once we get i
in a register, we can use a plain add r32, m32
instruction to do the addition. But in the first case, i*3 + j
must be done as a 64-bit addition to yield correct pointer arithmetic. So j
must be sign-extended to 64 bits before adding, and this cannot be done in a single memory-source add instruction. Instead, we first use movsx r64, m32
to load j
into a register with sign extension, then add r64, r64
to do the 64-bit addition. This explains why it takes an extra instruction.
Which of the two "should be preferred" is less about the efficiency and more about whether your code could conceivably be called with arguments that would overflow, and what you want to have happen in that situation. Worry about correct behavior before optimizing.
Just to highlight the code I'm talking about: *(a + i*3 + j) = k;
is performed at lines 12-13 and 16-20 in the asm code linked in the question:
mov eax, DWORD PTR [rsp+4] ; eax = i, zero-extend
movsx rdx, DWORD PTR [rsp+8] ; rdx = (int64_t)j, sign-extend to 64 bits
;;; lea rsi, [rsp+4] ; unrelated, set up args for next cin
;;; mov edi, OFFSET FLAT:_ZSt3cin ; unrelated, set up args for next cin
lea eax, [rax+rax*2] ; eax = i*3, still 32 bits
cdqe ; rax = (int64_t)i*3, sign-extended
add rax, rdx ; rax = (int64_t)(i*3) + (int64_t)j
mov edx, DWORD PTR [rsp+12] ; edx = k
mov DWORD PTR [rsp+16+rax*4], edx ; perform the store
Then the code for the next two versions, (&a[i*3])[j] = k;
(28-29 and 30-36) and *((&a[i*3])+j) = k;
(44-45 and 48-52) is the same; these also correspond to two "pointer plus index" steps and never do the int
addition.
Whereas a[i*3 + j] = k;
is at lines 60-65:
mov eax, DWORD PTR [rsp+4] ; eax = i
mov edx, DWORD PTR [rsp+12] ; edx = k
lea eax, [rax+rax*2] ; eax *= 3
add eax, DWORD PTR [rsp+8] ; eax += j (32 bit add!)
cdqe ; rax = (int64_t)(i*3+j)
mov DWORD PTR [rsp+16+rax*4], edx ; do the store