c++compilationx86-64micro-architecturememory-footprint

Can programs use (significantly) less memory when compiled for different processors?


I have a C++ program I'm compiling for AMD64. Of course, different processors, despite being AMD64, support different features and instructions because they implement different microarchitectures. An easy way to optimise the program for one's own machine is to just use -march=native in Clang or GCC, but this isn't very portable for distribution's sake. A more portable solution would be to pick and choose specific target features.

This obviously affects performance (some processors support AVX-512, some don't, some support AVX2, some don't, etc.), but can this affect memory usage (heap/stack, not code size) in any significant way?


Solution

  • Different alignment rules or type widths are the two main ways you could get a difference, but -march= doesn't change that, not when compiling for the same ABI on the same ISA. (Otherwise -march=skylake-avx512 code couldn't call -march=sandybridge code and vice versa, if they disagreed on struct layouts.)

    Compiling for a different ABI can save space especially in pointer-heavy data structures. Specifically an ILP32 ABI such as Linux x32 has 4 byte pointers instead of 8, so struct foo { foo *next; int val; }; is 8 bytes instead of 16 (after padding to make sizeof(foo) a multiple of the alignof(foo) it inherits from pointers needing 8-byte alignment). But that won't work for your use-case of 100GB of data; 32-bit pointers limit you to 4GiB of address space.


    -march= could have some small effect on stack space when auto-vectorizing. e.g. a function might align the stack by 64 in order to spill/reload a ZMM vector. Or with older GCC, align even if the final asm doesn't actually store or load any vectors to the stack frame. But that's at most an extra 56 bytes of wasted stack space per level of function nesting, vs. 16-byte alignment which can be had for free as part of the calling convention.

    GCC / clang's optimizers won't AFAIK do any optimizations that change the size of dynamic allocations. Clang can sometimes optimize away a dynamic allocation entirely in a function that for example creates and destroys a std::vector<float> foo(100); and all accesses to it can be optimized away. (e.g. store constants into the vector and then read them back, it can just optimize that away then eliminate the allocation, too. Or a std::vector that isn't even used.)

    Possibly a different allocator library that's better at reducing internal fragmentation could save space, if you end up with some memory pages allocated but not fully used. But that's not something -march= affects.