While I understand the issues of data race and memory barrier for multithreaded code, it made me wonder about the kernel. If the kernel reschedules a single threaded thread to a different core, the same issues could occur. Therefore memory barriers have to be involved. After doing some research, I learned that kernels may prefer scheduling the task to the same core for performance reasons (data still in cache, as an example) but this is not guaranteed. Therefore, does the kernel do a memory barrier for every context switch or only if a thread is being scheduled to a different core?
As far as the Linux kernel scheduler is concerned (v6.14), migration_cpu_stop running on a source cpu calls move_queued_task which grabs the runqueue lock on the destination cpu.
Releasing this lock pairs with acquiring the runqueue lock by the scheduler on the destination cpu. It acts as a release-acquire semi-permeable memory ordering to order prior memory accesses from the source CPU before the following memory accesses on the destination CPU.
Note that in addition to the migration case, the membarrier system call has even stricter requirements on memory ordering, and requires memory barriers near the beginning and end of scheduling. Those can be found as smp_mb__after_spinlock() early in __schedule(), and within mmdrop_lazy_tlb_sched() called from finish_task_switch().