Assume we're trying to use the tsc for performance monitoring and we we want to prevent instruction reordering.
These are our options:
1: rdtscp
is a serializing call. It prevents reordering around the call to rdtscp.
__asm__ __volatile__("rdtscp; " // serializing read of tsc
"shl $32,%%rdx; " // shift higher 32 bits stored in rdx up
"or %%rdx,%%rax" // and or onto rax
: "=a"(tsc) // output to tsc variable
:
: "%rcx", "%rdx"); // rcx and rdx are clobbered
However, rdtscp
is only available on newer CPUs. So in this case we have to use rdtsc
. But rdtsc
is non-serializing, so using it alone will not prevent the CPU from reordering it.
So we can use either of these two options to prevent reordering:
2: This is a call to cpuid
and then rdtsc
. cpuid
is a serializing call.
volatile int dont_remove __attribute__((unused)); // volatile to stop optimizing
unsigned tmp;
__cpuid(0, tmp, tmp, tmp, tmp); // cpuid is a serialising call
dont_remove = tmp; // prevent optimizing out cpuid
__asm__ __volatile__("rdtsc; " // read of tsc
"shl $32,%%rdx; " // shift higher 32 bits stored in rdx up
"or %%rdx,%%rax" // and or onto rax
: "=a"(tsc) // output to tsc
:
: "%rcx", "%rdx"); // rcx and rdx are clobbered
3: This is a call to rdtsc
with memory
in the clobber list, which prevents reordering
__asm__ __volatile__("rdtsc; " // read of tsc
"shl $32,%%rdx; " // shift higher 32 bits stored in rdx up
"or %%rdx,%%rax" // and or onto rax
: "=a"(tsc) // output to tsc
:
: "%rcx", "%rdx", "memory"); // rcx and rdx are clobbered
// memory to prevent reordering
My understanding for the 3rd option is as follows:
Making the call __volatile__
prevents the optimizer from removing the asm or moving it across any instructions that could need the results (or change the inputs) of the asm. However it could still move it with respect to unrelated operations. So __volatile__
is not enough.
Tell the compiler memory is being clobbered: : "memory")
. The "memory"
clobber means that GCC cannot make any assumptions about memory contents remaining the same across the asm, and thus will not reorder around it.
So my questions are:
__volatile__
and "memory"
correct?"memory"
looks much simpler than using another serializing instruction. Why would anyone use the 3rd option over the 2nd option?As mentioned in a comment, there's a difference between a compiler barrier and a processor barrier. volatile
and memory
in the asm statement act as a compiler barrier, but the processor is still free to reorder instructions.
Processor barriers are special instructions that must be explicitly given, e.g. rdtscp, cpuid
, memory fence instructions (mfence, lfence,
...) etc. lfence
is also an execution barrier (on Intel, and more recently AMD), so it's interesting in combination with rdtsc
(which isn't a memory operation, and is only ordered by *fence
instructions if something in a manual says so). Fun fact: x86's strongly-ordered memory model makes lfence
basically useless for memory ordering, leaving execution ordering as its main use-case.
As an aside, while using cpuid
as a barrier before rdtsc
is common, it can also be very bad from a performance perspective, since virtual machine platforms often trap and emulate the cpuid
instruction in order to impose a common set of CPU features across multiple machines in a cluster (to ensure that live migration works). Thus it's better to use a cheaper execution fence instruction like lfence
, or serialize
on very recent CPUs (which is also a memory barrier and fully serializes the pipeline like cpuid
but without a vmexit, so putting it before rdtsc
would wait for stores to commit as well, unlike lfence
which just waits for instructions to finish executing.)
The Linux kernel used to use mfence;rdtsc
on AMD platforms and lfence;rdtsc
on Intel. As of Linux kernel 5.4, lfence is used to serialize rdtsc on both Intel and AMD. See this commit "x86: Remove X86_FEATURE_MFENCE_RDTSC": https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be261ffce6f13229dad50f59c5e491f933d3167f