c++assembly c++20 pointer-arithmetic arrayaccess

Why aren't array access and pointer arithmetic equivalent with full optimization?

Why doesn't this code produce the same assembly? (g++ -O3) I know little of assembly but it seems case 2 accessing has less instructions, so should be preferred, right? I am asking this because I wanted to implement a wrapper class with an access operator that returns a pointer int* p = a[i] (so access is a[i][j], instead of a[i*3+j]), but don't know if it's worth it. Thank you for any help.

#include <iostream>

int main() {
    
    int a[9];
    int i, j, k;

    // Case 1
    std::cin >> i >> j >> k;
    *(a + i*3 + j) = k;

    std::cin >> i >> j >> k;
    (&a[i*3])[j] = k;

    std::cin >> i >> j >> k;
    *((&a[i*3])+j) = k;

    // Case 2
    std::cin >> i >> j >> k;
    a[i*3 + j] = k;

    std::cout << a[0];

    return 0;
}

https://godbolt.org/z/13arxcPqz

Edit: For completeness, this change where a is moved to the right is exactly as in case 2, as the operator+ now associates left.

// Case 2 again
std::cin >> i >> j >> k;
        *(i*3 + j + a) = k;

https://godbolt.org/z/x89453aK4

Solution

The expressions *(a + i*3 + j) and a[i*3 + j] are not equivalent at the level of C++. Since binary + associates left-to-right, the former is equivalent to *((a + i*3) + j) while the latter is equivalent to *(a + (i*3 + j)). They can produce different results if, for instance, the sum in i*3 + j would overflow int.

For a concrete example, consider a 64-bit machine with 32-bit int like your x86-64 system, and suppose we had i == 600'000'000 and j == 2'000'000'000. Suppose, instead of your array of length 9, that a points into an extremely big array on a 64-bit. The first expression adds 1'800'000'000 and then 2'000'000'000 to a, yielding a+3'800'000'000. The second adds 1'800'000'000+2'000'000'000 first, which overflows and causes undefined behavior. On some compilers, the behavior might be to "wrap around", yielding a+(-494'967'296), a completely different address that is 16 GB away from the other one.

The generated assembly reflects this distinction. In the second case, the addition i*3 + j is done as plain 32-bit addition, which would wrap around on overflow. Since j is in memory, once we get i in a register, we can use a plain add r32, m32 instruction to do the addition. But in the first case, i*3 + j must be done as a 64-bit addition to yield correct pointer arithmetic. So j must be sign-extended to 64 bits before adding, and this cannot be done in a single memory-source add instruction. Instead, we first use movsx r64, m32 to load j into a register with sign extension, then add r64, r64 to do the 64-bit addition. This explains why it takes an extra instruction.

Which of the two "should be preferred" is less about the efficiency and more about whether your code could conceivably be called with arguments that would overflow, and what you want to have happen in that situation. Worry about correct behavior before optimizing.

Just to highlight the code I'm talking about: *(a + i*3 + j) = k; is performed at lines 12-13 and 16-20 in the asm code linked in the question:

        mov     eax, DWORD PTR [rsp+4]        ; eax = i, zero-extend
        movsx   rdx, DWORD PTR [rsp+8]        ; rdx = (int64_t)j, sign-extend to 64 bits
        ;;; lea     rsi, [rsp+4]              ; unrelated, set up args for next cin
        ;;; mov     edi, OFFSET FLAT:_ZSt3cin ; unrelated, set up args for next cin
        lea     eax, [rax+rax*2]              ; eax = i*3, still 32 bits
        cdqe                                  ; rax = (int64_t)i*3, sign-extended
        add     rax, rdx                      ; rax = (int64_t)(i*3) + (int64_t)j
        mov     edx, DWORD PTR [rsp+12]       ; edx = k
        mov     DWORD PTR [rsp+16+rax*4], edx ; perform the store

Then the code for the next two versions, (&a[i*3])[j] = k; (28-29 and 30-36) and *((&a[i*3])+j) = k; (44-45 and 48-52) is the same; these also correspond to two "pointer plus index" steps and never do the int addition.

Whereas a[i*3 + j] = k; is at lines 60-65:

        mov     eax, DWORD PTR [rsp+4]        ; eax = i
        mov     edx, DWORD PTR [rsp+12]       ; edx = k
        lea     eax, [rax+rax*2]              ; eax *= 3
        add     eax, DWORD PTR [rsp+8]        ; eax += j (32 bit add!)
        cdqe                                  ; rax = (int64_t)(i*3+j)
        mov     DWORD PTR [rsp+16+rax*4], edx ; do the store