I read a number of articles and S/O answers saying that (on linux x86_64) FS (or GS in some variants) references a thread-specific page table entry, which then gives an array of pointers to the actual data that is in sharable data. When threads are swapped, all the registers are switched over, and the threaded base page therefore changes. Threaded variables are accessed by name with just 1 extra pointer hop, and the referenced values can be shared to other threads. All good and plausible.
Indeed, if you look at the code for __errno_location(void)
, the function behind errno
, you find something like (this is from musl, but gnu is not so much different):
static inline struct pthread *__pthread_self()
{
struct pthread *self;
__asm__ __volatile__ ("mov %%fs:0,%0" : "=r" (self) );
return self;
}
And from glibc:
=> 0x7ffff6efb4c0 <__errno_location>: endbr64
0x7ffff6efb4c4 <__errno_location+4>: mov 0x6add(%rip),%rax # 0x7ffff6f01fa8
0x7ffff6efb4cb <__errno_location+11>: add %fs:0x0,%rax
0x7ffff6efb4d4 <__errno_location+20>: retq
So my expectation is that the actual value for FS would change for each thread. E.g. under the debugger, gdb: info reg
or p $fs
, I would see the value of FS be different in different threads, but no: ds, es, fs, gs are all zero all the time.
In my own code, I write something like below and get the same - FS is unchanged but the TLV "works":
struct Segregs
{
unsigned short int cs, ss, ds, es, fs, gs;
friend std::ostream& operator << (std::ostream& str, const Segregs& sr)
{
str << "[cs:" << sr.cs << ",ss:" << sr.ss << ",ds:" << sr.ds
<< ",es:" << sr.es << ",fs:" << sr.fs << ",gs:" << sr.gs << "]";
return str;
}
};
Segregs GetSegRegs()
{
unsigned short int r_cs, r_ss, r_ds, r_es, r_fs, r_gs;
__asm__ __volatile__ ("mov %%cs,%0" : "=r" (r_cs) );
__asm__ __volatile__ ("mov %%ss,%0" : "=r" (r_ss) );
__asm__ __volatile__ ("mov %%ds,%0" : "=r" (r_ds) );
__asm__ __volatile__ ("mov %%es,%0" : "=r" (r_es) );
__asm__ __volatile__ ("mov %%fs,%0" : "=r" (r_fs) );
__asm__ __volatile__ ("mov %%gs,%0" : "=r" (r_gs) );
return {r_cs, r_ss, r_ds, r_es, r_fs, r_gs};
}
But the output?
Main: Seg regs : [cs:51,ss:43,ds:0,es:0,fs:0,gs:0]
Main: tls @0x7ffff699307c=0
Main: static @0x96996c=0
Modified to 1234
Main: tls @0x7ffff699307c=1234
Main: static @0x96996c=1234
Async thread
[New Thread 0x7ffff695e700 (LWP 3335119)]
Thread: Seg regs : [cs:51,ss:43,ds:0,es:0,fs:0,gs:0]
Thread: tls @0x7ffff695e6fc=0
Thread: static @0x96996c=1234
So something else is actually going on? What extra trickery is happening, and why add the complication?
For context I'm trying to do something "funky with forks", so I would like to know the gory detail.
In 64-bit mode, the actual contents of the 16-bit FS and GS segment registers are normally the "null selector" (0
), because other mechanisms are used to set the segment bases with 64-bit values. (MSR or wrfsbase
)
Like in protected mode, there are separate "FSBASE" and "GSBASE" registers within the CPU, and when you specify, say, an FS segment override to an instruction, the base address from the FSBASE register is added to the operand's effective address to determine the actual linear address to be accessed.
The kernel's context structure for each thread stores a copy of its FSBASE and GSBASE registers, and they are reloaded appropriately on each context switch.
So what actually happens is that each thread sets its FSBASE register to point to its own thread-local storage. (Depending on the CPU features and OS design, this may only be possible for privileged code, so a system call may be required.) Then instructions with an FS segment override can be used to access an object with a given offset in the thread-local storage block, as you've seen.
In 32-bit mode, on the other hand, the values in FS and GS do have more meaning; they are segment selectors which are used to index into a descriptor table maintained by the kernel. The descriptor table holds the actual segment info, including its base address, and you could use a system call to ask the kernel to modify it. Each thread would have its own local descriptor table, so you wouldn't necessarily see different selectors in FS for different threads, but it would still be the case that FS-override instructions from different threads would result in accesses to different linear addresses.
(Or a 32-bit kernel could write into a GDT entry and mov
a constant from a register into fs
or gs
to get it to reload that newly-written GDT entry. So it would only need a GDT per logical core instead an LDT per process. The CPU never reloads a segment descriptor on its own, although with a per-core GDT the entry would still match the current task if you had separate entries for FS and GS. So user-space might not break itself with mov eax,gs
/ mov gs,eax
.)
Anyway, this is really just the lack of a convenient MSR or wrfsbase
way to set segment register bases separately from mov Sreg, r/m
triggering the CPU to load the descriptor. In either protected or long mode, in the segment register value does need to be valid (including null = 0
), and moving some random value into it will likely fault.