linuxlinux-kernelschedulercontext-switching

Interpretation of prev task within context_switch() in Linux kernel scheduler code


I'm trying to understand the implementation of context switching in the Linux kernel (specifically in x86) and to that end I have a couple of questions.

  1. Why is the switch_to() macro defined and invoked as shown below?

  2. Is the value of prev in location (1) in the comments below the same as it is in location (2), or does the stack switch as part of switch_to() change it? If prev and next are stored in registers rsi and rdx, which don't appear to be saved in __switch_to_asm(), I would assume prev is the same in locations (1) and (2), but given the definition of the macro perhaps that's not the case?

#define switch_to(prev, next, last)                 \
do {                                    \
    ((last) = __switch_to_asm((prev), (next)));         \
} while (0)

From Linux kernel v5.8.6:

static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
           struct task_struct *next, struct rq_flags *rf)
{
    prepare_task_switch(rq, prev, next);

    /*
     * For paravirt, this is coupled with an exit in switch_to to
     * combine the page table reload and the switch backend into
     * one hypercall.
     */
    arch_start_context_switch(prev);

    /*
     * kernel -> kernel   lazy + transfer active
     *   user -> kernel   lazy + mmgrab() active
     *
     * kernel ->   user   switch + mmdrop() active
     *   user ->   user   switch
     */
    if (!next->mm) {                                // to kernel
        enter_lazy_tlb(prev->active_mm, next);

        next->active_mm = prev->active_mm;
        if (prev->mm)                           // from user
            mmgrab(prev->active_mm);
        else
            prev->active_mm = NULL;
    } else {                                        // to user
        membarrier_switch_mm(rq, prev->active_mm, next->mm);
        /*
         * sys_membarrier() requires an smp_mb() between setting
         * rq->curr / membarrier_switch_mm() and returning to userspace.
         *
         * The below provides this either through switch_mm(), or in
         * case 'prev->active_mm == next->mm' through
         * finish_task_switch()'s mmdrop().
         */
        switch_mm_irqs_off(prev->active_mm, next->mm, next);

        if (!prev->mm) {                        // from kernel
            /* will mmdrop() in finish_task_switch(). */
            rq->prev_mm = prev->active_mm;
            prev->active_mm = NULL;
        }
    }

    rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

    prepare_lock_switch(rq, next, rf);

// LOCATION (1) 

    /* Here we just switch the register state and the stack. */
    switch_to(prev, next, prev);

// LOCATION (2) 

    barrier();

    return finish_task_switch(prev);
}

Solution

  • Why the switch_to() macro is defined and invoked as shown below (referenced below that)?

    I can only assume this functionality is implemented as a macro because it is arch-dependent and it could possibly need more than a simple function call (like you see in x86). This is a common pattern for arch-specific code. You can see the different implementations of the macro here.

    Is the value of prev in location (1) in the comment below the same as it is in location (2) or does the stack switch as part of switch_to() change it?

    The stack switch does not really matter. The value of prev is not taken from nor saved to the stack. The new value of prev will simply be the return value of __switch_to_asm(), which is the return value of __switch_to() (since the former performs a tail call to the latter). Since __switch_to() returns the passed prev as is (see code), the final result is that the prev value is left unchanged.

    This makes sense, you don't want the value of prev to change after the context is switched through switch_to(). If you schedule with prev=A next=B and previously B was switched away with prev=B next=C, you want B to resume with prev=A, not prev=B!