I'm trying to understand the implementation of context switching in the Linux kernel (specifically in x86) and to that end I have a couple of questions.
Why is the switch_to()
macro defined and invoked as shown below?
Is the value of prev
in location (1) in the comments below the same as it is in location (2), or does the stack switch as part of switch_to()
change it? If prev
and next
are stored in registers rsi
and rdx
, which don't appear to be saved in __switch_to_asm()
, I would assume prev
is the same in locations (1) and (2), but given the definition of the macro perhaps that's not the case?
#define switch_to(prev, next, last) \
do { \
((last) = __switch_to_asm((prev), (next))); \
} while (0)
From Linux kernel v5.8.6:
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next, struct rq_flags *rf)
{
prepare_task_switch(rq, prev, next);
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
* one hypercall.
*/
arch_start_context_switch(prev);
/*
* kernel -> kernel lazy + transfer active
* user -> kernel lazy + mmgrab() active
*
* kernel -> user switch + mmdrop() active
* user -> user switch
*/
if (!next->mm) { // to kernel
enter_lazy_tlb(prev->active_mm, next);
next->active_mm = prev->active_mm;
if (prev->mm) // from user
mmgrab(prev->active_mm);
else
prev->active_mm = NULL;
} else { // to user
membarrier_switch_mm(rq, prev->active_mm, next->mm);
/*
* sys_membarrier() requires an smp_mb() between setting
* rq->curr / membarrier_switch_mm() and returning to userspace.
*
* The below provides this either through switch_mm(), or in
* case 'prev->active_mm == next->mm' through
* finish_task_switch()'s mmdrop().
*/
switch_mm_irqs_off(prev->active_mm, next->mm, next);
if (!prev->mm) { // from kernel
/* will mmdrop() in finish_task_switch(). */
rq->prev_mm = prev->active_mm;
prev->active_mm = NULL;
}
}
rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
prepare_lock_switch(rq, next, rf);
// LOCATION (1)
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev);
// LOCATION (2)
barrier();
return finish_task_switch(prev);
}
Why the
switch_to()
macro is defined and invoked as shown below (referenced below that)?
I can only assume this functionality is implemented as a macro because it is arch-dependent and it could possibly need more than a simple function call (like you see in x86). This is a common pattern for arch-specific code. You can see the different implementations of the macro here.
Is the value of
prev
in location (1) in the comment below the same as it is in location (2) or does the stack switch as part ofswitch_to()
change it?
The stack switch does not really matter. The value of prev
is not taken from nor saved to the stack. The new value of prev
will simply be the return value of __switch_to_asm()
, which is the return value of __switch_to()
(since the former performs a tail call to the latter). Since __switch_to()
returns the passed prev
as is (see code), the final result is that the prev
value is left unchanged.
This makes sense, you don't want the value of prev
to change after the context is switched through switch_to()
. If you schedule with prev=A next=B
and previously B
was switched away with prev=B next=C
, you want B
to resume with prev=A
, not prev=B
!