macosx86pthreadsx86-64detours

Threads spawned by a detoured pthread_create do not execute instructions


I've got a custom implementation of detours on macOS and a test application using it, which is written in C, compiled for macOS x86_64, running on an Intel i9 processor.

The implemention works fine with a multitude of functions. However, if I detour pthread_create, I encounter strange behaviour: threads that have been spawned via a detoured pthread_create do not execute instructions. I can step through instructions one by one but as soon as I continue it does not progress. There are no mutexes or synchronisation involved and the result of the function is 0 (success). The exact same application with detours turned off works fine so it's unlikely to be the culprit.

This does not happen all the time - sometimes they are fine but at other times the test applications stalls in the following state:

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff7296f55e libsystem_kernel.dylib`__ulock_wait + 10
    frame #1: 0x00007fff72a325c2 libsystem_pthread.dylib`_pthread_join + 347
    frame #2: 0x0000000100001186 DetoursTestApp`main + 262
    frame #3: 0x00007fff7282ccc9 libdyld.dylib`start + 1
    frame #4: 0x00007fff7282ccc9 libdyld.dylib`start + 1
  thread #2
    frame #0: 0x00007fff72a2cb7c libsystem_pthread.dylib`thread_start

Relevant memory pages have the executable flag set. The detour function that intercepts the thread creation looks like this:

static int pthread_create_detour(pthread_t* thread,
                                 const pthread_attr_t* attr,
                                 void* (*start_routine)(void*),
                                 void* arg)
{
    detour_count++;
    pthread_fn original = (pthread_fn)detour_original(dlsym((void*)-1, "pthread_create"));
    return original(thread, attr, start_routine, arg);
}

Where detour_original retrieves the pointer to [original function + size of function's prologue]. Tracing through the instructions, everything seems to be working correctly and pthread_create terminates successfully. Tracing the application's system calls via dtruss does show calls to

bsdthread_create(0x10DB964B0, 0x0, 0x7000080DB000)               = 29646848 0

With what I have confirmed are the correct arguments.

This behaviour is only observed in release builds - debug works fine but the disassembly and execution of a detoured pthread_create and associated detours code seems to be identical in both cases.


Workarounds

I found a couple of odd workarounds for this issue that don't make much sense. Given the detour function, a number of things can be substituted into the following:

static int pthread_create_detour(pthread_t* thread,
                                 const pthread_attr_t* attr,
                                 void* (*start_routine)(void*),
                                 void* arg)
{
    detour_count++;
    pthread_fn original = (pthread_fn)detour_original(dlsym((void*)-1, "pthread_create"));
    <...> <== SUBSTITUTE HERE
    return original(thread, attr, start_routine, arg);
}
  1. A cache flush.
    __asm__ __volatile__("" ::: "memory");
    _mm_clflush(real_pthread_create);
  1. A sleep of any duration - usleep(1)
  2. A printf statement.
  3. A memory allocation larger than 32768 bytes, e.g. void *data = malloc(40000);.

Cache?

All of these seem to point to a stale instruction cache. However, the Intel manual states the following:

A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction.

What's even more interesting is that those workarounds have to be executed for every new thread created, with the execution happening on the main thread, so it's very unlikely to be the cache. I have also tried putting in cache flushes at every memory write that writes instructions but that did not help. I've also written a memcpy that bypasses the cache with the use of Intel's intrinsic _mm_stream_si32 and swapped it out for every instruction memory write in my implementation without any success.


Race condition?

The next suspect in line is a race condition. However, it's not clear what would be racing as at first there are no other threads. I have put in a fibonacci sequence calculation for a randomly-generated number and that would still stall the newly-spawned threads.


The question

What is causing this issue? What other mechanisms could be responsible for this?

At this point I have run out of things to check so any suggestions will be welcome.


Solution

  • I found that the reason why the spawned thread was not executing instructions was that the r8 register wasn't being cleared at the right time in the execution of pthread_create due to an issue with my detours implementation.

    If we look at the disassembly of the function, it is split up to two parts - the "head" and the "body" that's found in an internal _pthread_create function. The head does two things - zeroes out r8 and jumps to the body:

    libsystem_pthread.dylib`pthread_create:
        0x7fff72a2e236 <+0>: 45 31 c0        xor    r8d, r8d
        0x7fff72a2e239 <+3>: e9 40 37 00 00  jmp    0x7fff72a3197e            ; _pthread_create
    
    libsystem_pthread.dylib`_pthread_create:
        0x7fff72a3197e <+0>:    55                                push   rbp
        0x7fff72a3197f <+1>:    48 89 e5                          mov    rbp, rsp
        0x7fff72a31982 <+4>:    41 57                             push   r15
        <...> // the rest of the 1409 instructions
    

    My implementation would detour the internal _pthread_create function instead of the head containing the actual entry point which meant that the r8 would get cleared at the wrong time (before the detour). Since the detour function would contain some could, the execution would go something like:

    pthread_create (r8 gets cleared) -> _pthread_create -> chain of jumps -> pthread_create_detour -> trampoline (containing the beginning of _pthread_create) -> _pthread_create + 6

    Which meant that depending on the contents of the pthread_create_detour function the r8 would not always end up with a 0 when it returned to the internal function.

    It's not yet clear why having r8 set to something other than 0 before _pthread_create would not crash but instead start up a thread in a locked up state. An important detail is that the stalled thread would have the rflags register set to 0x200 which should never be the case according to Intel's manual. This is what lead me to inspecting the CPU state more closely, leading to the answer.