assemblyx86cpu-architecturememory-segmentationcpu-cycles

Is a mov to a segmentation register slower than a mov to a general purpose register?


Specifically is:

mov %eax, %ds

Slower than

mov %eax, %ebx

Or are they the same speed. I've researched online, but have been unable to find a definitive answer.

I'm not sure if this is a silly question, but I think it's conceivable modifying a segmentation register could make the processor do extra work.

N.B I'm concerned with old x86 linux cpus, not modern x86_64 cpus, where segmentation works differently.


Solution

  • mov %eax, %ebx between general-purpose registers is one of the most common instructions. Modern hardware supports it extremely efficiently, often with special cases that don't apply to any other instruction. On older hardware, it's always been one of the cheapest instructions.

    On Ivybridge and later, it doesn't even need an execution unit and has zero latency. It's handled in the register-rename stage. Can x86's MOV really be "free"? Why can't I reproduce this at all? Even on earlier CPUs, it's 1 uop for any ALU port (so typically 3 or 4 per clock throughput).

    On AMD Piledriver / Steamroller, mov r32,r32 and r64,r64 can run on AGU ports as well as ALU ports, giving it 4 per clock throughput vs. 2 per clock for add, or for mov on 8 or 16-bit registers (which have to merge into the destination).


    mov to a segment reg is a fairly rare instruction in typical 32 and 64-bit code. It is part of what kernels do for every system call (and probably interrupts), though, so so making it efficient will speed up the fast-path for system-call and I/O intensive workloads. So even though it appears in only a few places, it can run a fair amount. But it's still of minor importance compared to mov r,r!

    mov to a segment reg is slow: it triggers a load from the GDT or LDT to update the descriptor cache, so it's microcoded.

    This is the case even in x86-64 long mode; the segment base/limit fields in the GDT entry are ignored, but it still has to update the descriptor cache with other fields from the segment descriptor, including the DPL (descriptor privilege level) which does apply to data segments.


    Agner Fog's instruction tables list uop counts and throughput for mov sr, r (Intel synax, mov to segment reg) for Nehalem and earlier CPUs. He stopped testing seg regs for later CPUs because it's obscure and not used by compilers (or humans optimizing by hand), but the counts for SnB-family are probably somewhat similar. (InstLatx64 doesn't test seg regs either, e.g. not in this Sandybridge instruction-timing test)

    MOV sr,r on Nehalem (presumably tested in protected mode or long mode):

    Other CPUs are similar:

    Weird Al was right, It's All About the Pentiums

    In-order Pentium (P5 / PMMX) had cheaper mov-to-sr: Agner lists it as taking ">= 2 cycles", and non-pairable. (P5 was in-order 2-wide superscalar with some pairing rules on which instructions could execute together). That seems cheap for protected mode, so maybe the 2 is in real mode and protected mode is the greater-than? We know from his P4 table notes that he did test stuff in 16-bit mode back then.


    Agner Fog's microarch guide says that Core2 / Nehalem can rename segment registers (Section 8.7 Register renaming):

    All integer, floating point, MMX, XMM, flags and segment registers can be renamed. The floating point control word can also be renamed.

    (Pentium M could not rename the FP control word, so changing the rounding mode blocks OoO exec of FP instructions. e.g. all earlier FP instructions have to finish before it can modify the control word, and later ones can't start until after. I guess segment regs would be the same but for load and store uops.)

    He says that Sandybridge can "probably" rename segment regs, and Haswell/Broadwell/Skylake can "perhaps" rename them. My quick testing on SKL shows that writing the same segment reg repeatedly is slower than writing different segment regs, which indicates that they're not fully renamed. It seems like an obvious thing to drop support for, because they're very rarely modified in normal 32 / 64-bit code.

    And each seg reg is usually only modified once at a time, so multiple dep chains in flight for the same segment register is not very useful. (i.e. you won't see WAW hazards for segment regs in Linux, and WAR is barely relevant because the kernel won't use user-space's DS for any memory references in a kernel entry-point. (I think interrupts are serializing, but entering the kernel via syscall could maybe still have a user-space load or store in flight but not executed yet.)

    In chapter 2, which explains out-of-order exec in general (all CPUs except P1 / PMMX), 2.2 register renaming says that "possibly segment registers can be renamed", but IDK if he means that some CPUs do and some don't, or if he's not sure about some old CPUs. He doesn't mention seg reg renaming in the PII/PII or Pentium-M sections, so I can't tell you about the old 32-bit-only CPUs you're apparently asking about. (And he doesn't have a microarch guide section for AMD before K8.)

    You could benchmark it yourself if you're curious, with performance counters. (See Are loads and stores the only instructions that gets reordered? for an example of how to test for blocking out-of-order execution, and Can x86's MOV really be "free"? Why can't I reproduce this at all?) for basics on using perf on Linux to do microbenchmarks on tiny loops.


    Reading a segment reg

    mov from a segment reg is relatively cheap: it only modifies a GP register, and CPUs are good at writes to GP registers, with register-renaming etc. Agner Fog found it was a single uop on Nehalem. Fun fact, on Core2 / Nehalem it runs on the load port, so I guess that's where segment regs are stored on that microarchitecture.

    (Except on P4: apparently reading seg regs was expensive there.)

    A quick test on my Skylake (in long mode) shows that mov eax, fs (or cs or ds or whatever) is 2 uops, one of which only runs on port 1, and the other can run on any of p0156. (i.e. it runs on ALU ports). It has a throughput of 1 per clock, bottlenecked on port 1.

    Not tested: interleaving with memory instructions, including cache-miss loads, to see if multiple dep chains could be in flight. So I really only tested throughput. If there's a throughput bottleneck other than the WAW hazard itself, that doesn't rule out tracking segment registers along with loads/stores. But that seems unlikely to be worth it for modern code: segment regs typically only change right before or after a privilege-level change that drains the out-of-order back end anyway, not mixed in with various loads/stores. Except maybe changing FS or GS base on context switches.


    You normally only mess with FS or GS for thread-local storage, and you don't do it with mov to FS, you make a system call to have the OS use an MSR or wrfsbase to modify the segment base in the cached segment description. (Or if the OS allows and CPU supports, you could use wrfsbase in user-space.)


    N.B I'm concerned with old x86 linux cpus, not modern x86_64 cpus, where segmentation works differently.

    You said "Linux", so I assume you mean protected mode, not real mode (where segmentation works completely differently). Probably mov sr, r decodes differently in real mode, but I don't have a test setup where I can profile with performance counters for real or VM86 mode running natively.

    FS and GS in long mode work basically the same as in protected mode, it's the other seg regs that are "neutered" in long mode. I think the Agner Fog's Core2 / Nehalem numbers are probably similar to what you'd see in a PIII in protected mode. They're part of the same microarchitecture family. I don't think we have a useful number for P5 Pentium segment register writes in protected mode.

    (Sandybridge was the first of a new family derived from P6-family with significant internal changes, and some ideas from P4 implemented a different (better) way, e.g. SnB's decoded-uop cache is not a trace cache. But more importantly, SnB uses a physical register file instead of keeping values right in the ROB, so its register renaming machinery is different.)