c++assemblycompiler-optimizationself-modifying

Can compilers generate self modifying code?


It is commonly said that a static variable initialization is wrapped in an if to prevent it from being initialized multiple times.

For this, and other one-off conditions, it would be more efficient to have the code remove the conditional after the first pass via self modification.

Are C++ compilers allowed to generate such code and if not, why? I heard that it might have a negative impact on the cache, but I don’t know the details.


Solution

  • Yes, that would be legal. ISO C++ makes zero guarantees about being able to access data (machine code) through function pointers cast to unsigned char*. On most real implementations it's well defined, except on pure-Harvard machines where code and data have separate address spaces.

    Hot-patching (usually by external tools) is a thing, and is very doable if compilers generate code to make that easy, i.e. the function starts with a long-enough instruction that can be atomically replaced.

    As Ross points out, a major obstacle to self-modification on most C++ implementations is that they make programs for OSes that normally map executable pages read-only. W^X is an important security feature to avoid code-injection. Only for very long-running programs with very hot code paths would it be overall worth it to make necessary system calls to make the page read+write+exec temporary, atomically modify an instruction, then flip it back.

    And impossible on systems like OpenBSD that truly enforce W^X, not letting a process mprotect a page with both PROT_WRITE and PROT_EXEC. Making a page temporarily non-executable doesn't work if other threads can call the function at any moment.

    It is commonly said that a static variable initialization is wrapped in an if to prevent it from being initialized multiple times.

    Only for non-constant initializers, and of course only for static locals. A local like static int foo = 1; will compile the same as at global scope, to a .long 1 (GCC for x86, GAS syntax) with a label on it.

    But yes, with a non-constant initializer, compilers will invent a guard variable they can test. They arrange things so the guard variable is read-only, not like a readers/writers lock, but that does still cost a couple extra instructions on the fast path.

    e.g.

    int init();
    
    int foo() {
        static int counter = init();
        return ++counter;
    }
    

    compiled with GCC10.2 -O3 for x86-64

    foo():             # with demangled symbol names
            movzx   eax, BYTE PTR guard variable for foo()::counter[rip]
            test    al, al
            je      .L16
            mov     eax, DWORD PTR foo()::counter[rip]
            add     eax, 1
            mov     DWORD PTR foo()::counter[rip], eax
            ret
    
    .L16:  # slow path
       acquire lock, one thread does the init while the others wait
    

    So the fast path check costs 2 uops on mainstream CPUs: one zero-extending byte load, one macro-fused test-and-branch (test + je) that's not-taken. But yes, it has non-zero code-size for both L1i cache and decoded-uop cache, and non-zero cost to issue through the front-end. And an extra byte of static data that has to stay hot in cache for good performance.

    Normally inlining makes this negligible. If you're actually calling a function with this at the start often enough to matter, the rest of the call/ret overhead is a larger problem.

    But things aren't so nice on ISAs without cheap acquire loads. (e.g. ARM before ARMv8). Instead of somehow arranging to barrier() all threads once after initializing the static variable, every check of the guard variable is an acquire load. But on ARMv7 and earlier, that's done with a full memory barrier dmb ish (data memory barrier: inner shareable) that includes draining the store buffer, exactly the same as for atomic_thread_fence(mo_seq_cst). (ARMv8 has ldar (word) / ldab (byte) to do acquire loads, making them nice and cheap.)

    Godbolt with ARMv7 clang

    # ARM 32-bit clang 10.0 -O3 -mcpu=cortex-a15
    # GCC output is even more verbose because of Cortex-A15 tuning choices.
    foo():
            push    {r4, r5, r11, lr}
            add     r11, sp, #8
            ldr     r5, .LCPI0_0           @ load a PC-relative offset to the guard var
    .LPC0_0:
            add     r5, pc, r5
            ldrb    r0, [r5, #4]           @ load the guard var
            dmb     ish                    @ full barrier, making it an acquire load
            tst     r0, #1
            beq     .LBB0_2                @ go to slow path if low bit of guard var == 0
    .LBB0_1:
            ldr     r0, .LCPI0_1           @ PC-relative load of a PC-relative offset
    .LPC0_1:
            ldr     r0, [pc, r0]           @ load counter
            add     r0, r0, #1             @ ++counter leaving value in return value reg
            str     r0, [r5]               @ store back to memory, IDK why a different addressing mode than the load.  Probably a missed optimization.
            pop     {r4, r5, r11, pc}      @ return by popping saved LR into PC
    

    But just for fun, let's look at exactly how your idea could be implemented.

    Assuming you can PROT_WRITE|PROT_EXEC (to use POSIX terminology) a page containing the code, it's not a hard problem to solve for most ISAs, such as x86.

    Start the function with jmp rel32 or whatever to a "cold" section of code that does mutual exclusion to run the non-constant static initializer in one thread. (So if you do have multiple threads start to run it before one finishes and modifies the code, it all works the way it does now.)

    Once construction is fully done, use an 8-byte atomic CAS or store to replace that 5-byte instruction with different instruction bytes. Possibly just a NOP, or possibly something useful that was done at the top of the "cold" code.

    Or on non-x86 with fixed-width instructions of the same width it can atomically store, just a single word store can replace one jump instruction.