gccx86inline-assemblymachine-codegem5

which MOV instructions in the x86 are not used or the least used, and can be used for a custom MOV extension


I am modelling a custom MOV instruction in the X86 architecture in the gem5 simulator, to test its implementation on the simulator, I need to compile my C code using inline assembly to create a binary file. But since it a custom instruction which has not been implemented in the GCC compiler, the compiler will throw out an error. I know one way is to extend the GCC compiler to accept my custom X86 instruction, but I do not want to do it as it is more time consuming(but will do it afterwards).

As a temporary hack (just to check if my implementation is worth it or not). I want to edit an already MOV instruction while changing its underlying "micro ops" in the simulator so as to trick the GCC to accept my "custom" instruction and compile.

As they are many types of MOV instructions which are available in the x86 architecture. As they are various MOV Instructions in the 86 architecture reference.

Therefore coming to my question, which MOV instruction is the least used and that I can edit its underlying micro-ops. Assuming my workload just includes integers i.e. most probably wont be using the xmm and mmx registers and my instructions mirrors the same implementation of a MOV instruction.


Solution

  • Your best bet is regular mov with a prefix that GCC will never emit on its own. i.e. create a new mov encoding that includes a mandatory prefix in front of any other mov. Like how lzcnt is rep bsr.

    Or if you're modifying GCC and as, you can add a new mnemonic that just uses otherwise-invalid (in 64-bit mode) single byte opcodes for memory-source, memory-dest, and immediate-source versions of mov. AMD64 freed up several opcodes, including the BCD instructions like AAM, and push/pop most segment registers. (x86-64 can still mov to/from Sregs, but there's just 1 opcode per direction, not 2 per Sreg for push ds/pop ds etc.)

    Assuming my workload just includes integers i.e. most probably wont be using the xmm and mmx registers

    Bad assumption for XMM: GCC aggressively uses 16-byte movaps / movups instead of copying structs 4 or 8 bytes at a time. It's not at all rare to find vector mov instructions in scalar integer code as part of inline expansion of small known-length memcpy or struct / array init. Also, those mov instructions have at least 2-byte opcodes (SSE1 0F 28 movaps, so a prefix in front of plain mov is the same size as your idea would have been).

    However, you're right about MMX regs. I don't think modern GCC will ever emit movq mm0, mm1 or use MMX at all, unless you use MMX intrinsics. Definitely not when targeting 64-bit code.

    Also mov to/from control regs (0f 21/23 /r) or debug registers (0f 20/22 /r) are both the mov mnemonic, but gcc will definitely never emit either on its own. Only available with GP register operands as the operand that isn't the debug or control register. So that's technically the answer to your title question, but probably not what you actually want.


    GCC doesn't parse its inline asm template string, it just includes it in its asm text output to feed to the assembler after substituting for %number operands. So GCC itself is not an obstacle to emitting arbitrary asm text using inline asm.

    And you can use .byte to emit arbitrary machine code.

    Perhaps a good option would be to use a 0E byte as a prefix for your special mov encoding that you're going to make GEM decode specially. 0E is push CS in 32-bit mode, invalid in 64-bit mode. GCC will never emit either.

    Or just an F2 repne prefix; GCC will never emit repne in front of a mov opcode (where it doesn't apply), only movs. (F3 rep / repe means xrelease when used on a memory-destination instruction so don't use that. https://www.felixcloutier.com/x86/xacquire:xrelease says that F2 repne is the xacquire prefix when used with locked instructions, which doesn't include mov to memory so it will be silently ignored there.)

    As usual, prefixes that don't apply have no documented behaviour, but in practice CPUs that don't understand a rep / repne ignore it. Some future CPU might understand it to mean something special, and that's exactly what you're doing with GEM.

    Picking .byte 0x0e; instead of repne; might be a better choice if you want to guard against accidentally leaving these prefixes in a build you run on a real CPU. (It will #UD -> SIGILL in 64-bit mode, or usually crash from messing up the stack in 32-bit mode.) But if you do want to be able to run the exact same binary on a real CPU, with the same code alignment and everything, then an ignored REP prefix is ideal.


    Using a prefix in front of a standard mov instruction has the advantage of letting the assembler encode the operands for you:

    template<class T>
    void fancymov(T& dst, T src) {
        // fixme: imm -> mem  needs a size suffix, defeating template
        // unless you use Intel-syntax where the operand includes "dword ptr"
        asm("repne; movl  %1, %0"
    #if 1
           : "=m"(dst)
           : "ri" (src)
    #else
           : "=g,r"(dst)
           : "ri,rmi" (src)
    #endif
           : // no clobbers
        );
    }
    
    void test(int *dst, long src) {
        fancymov(*dst, (int)src);
        fancymov(dst[1], 123);
    }
    

    (Multi-alternative constraints let the compiler pick either reg/mem destination or reg/mem source. In practice it prefers the register destination even when that will cost it another instruction to do its own store, so that sucks.)

    On the Godbolt compiler explorer, for the version that only allows a memory-destination:

    test(int*, long):
            repne; movl  %esi, (%rdi)       # F2 E9 37
            repne; movl  $123, 4(%rdi)      # F2 C7 47 04 7B 00 00 00
            ret
    

    If you wanted this to be usable for loads, I think you'd have to make 2 separate versions of the function and use the load version or store version manually, where appropriate, because GCC seems to want to use reg,reg whenever it can.


    Or with the version allowing register outputs (or another version that returns the result as a T, see the Godbolt link):

    test2(int*, long):
            repne; mov  %esi, %esi
            repne; mov  $123, %eax
            movl    %esi, (%rdi)
            movl    %eax, 4(%rdi)
            ret