When linked "properly" (explained further), both function calls below block indefinitely on pthread calls implementing cv.notify_one
and cv.wait_for
:
// let's call it odr.cpp, which forms libodr.so
std::mutex mtx;
std::condition_variable cv;
bool ready = false;
void Notify() {
std::chrono::milliseconds(100);
std::unique_lock<std::mutex> lock(mtx);
ready = true;
cv.notify_one();
}
void Get() {
std::unique_lock<std::mutex> lock(mtx);
cv.wait_for(lock, std::chrono::milliseconds(300));
}
when shared library above is used in following application:
// let's call it test.cpp, which forms a.out
int main() {
std::thread thr([&]() {
std::cout << "Notify\n";
Notify();
});
std::cout << "Before Get\n";
Get();
std::cout << "After Get\n";
thr.join();
}
Problem reproduces only when linking libodr.so
:
-lpthread
as dependencywith following versions of relevant tools:
Linux Mint 18.3 Sylvia
binutils 2.26.1-1ubuntu1~16.04.6
g++ 4:5.3.1-1ubuntu1
libc6:amd64 2.23-0ubuntu10
so that we end up with:
__pthread_key_create
defined as WEAK symbol in PLTlibpthread.so
as dependency in ELFas shown here:
$ g++ -fPIC -shared -o build/libodr.so build/odr.cpp.o -fuse-ld=gold -lpthread && readelf -d build/libodr.so | grep Shared && readelf -Ws build/libodr.so | grep -m1 __pthread_key_create
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
10: 0000000000000000 0 FUNC WEAK DEFAULT UND __pthread_key_create
On the other hand, with any of the following we experience no bug:
-lpthread
-lpthread
but with -Wl,--no-as-needed
note: this time we have either:
NOTYPE
and no libpthread.so
dependencyWEAK
and libpthread.so
dependencyas shown here:
$ clang++ -fPIC -shared -o build/libodr.so build/odr.cpp.o -fuse-ld=gold -lpthread && readelf -d build/libodr.so | grep Shared && readelf -Ws build/libodr.so | grep -m1 __pthread_key_create && ./a.out
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
24: 0000000000000000 0 FUNC WEAK DEFAULT UND __pthread_key_create@GLIBC_2.2.5 (7)
$ g++ -fPIC -shared -o build/libodr.so build/odr.cpp.o -fuse-ld=bfd -lpthread && readelf -d build/libodr.so | grep Shared && readelf -Ws build/libodr.so | grep -m1 __pthread_key_create && ./a.out
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
14: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __pthread_key_create
$ g++ -fPIC -shared -o build/libodr.so build/odr.cpp.o -fuse-ld=gold && readelf -d build/libodr.so | grep Shared && readelf -Ws build/libodr.so | grep -m1 __pthread_key_create && ./a.out 0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
18: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __pthread_key_create
$ g++ -fPIC -shared -o build/libodr.so build/odr.cpp.o -fuse-ld=gold -Wl,--no-as-needed -lpthread && readelf -d build/libodr.so | grep Shared && readelf -Ws build/libodr.so | grep -m1 __pthread_key_create && ./a.out
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
10: 0000000000000000 0 FUNC WEAK DEFAULT UND __pthread_key_create@GLIBC_2.2.5 (4)
Complete example to compile/run can be found here: https://github.com/aurzenligl/study/tree/master/cpp-pthread
What breaks shlib using pthread when __pthread_key_create
is WEAK
and no libpthread.so
dependency in ELF can be found? Does the dynamic linker take the pthread symbols from libc.so
(stubs) instead of libpthread.so
?
There's a lot happening here: differences between gcc and clang, differences between gnu ld and gold, the --as-needed
linker flag, two different failure modes, and maybe even some timing issues.
Let's start with how to link a program using POSIX threads.
The compiler's -pthread
flag is all you should need. It's a compiler flag, so you should use it both when compiling code that uses threads and when linking the final executable. When you use -pthread
on the link step, the compiler will provide the -lpthread
flag automatically, and in the right place in the link line.
Typically, you would only use it when linking the final executable, and not when linking a shared library. If you simply want to make your library thread safe, but don't want to force every program that uses your library to link with pthreads, you'd want to use a runtime check to see if the pthreads library is loaded, and call the pthread APIs only if it is. On Linux, this is typically done by checking a "canary" -- for example, make a weak reference to an arbitrary symbol like __pthread_key_create
, which will only be defined if the library is loaded, and will have the value 0 if the program was linked without it.
In your case, however, your library libodr.so
pretty much depends on threads, so it's reasonable to link it with the -pthread
flag.
That brings us to the first failure mode: if you use g++ and gold for both link steps, the program throws std::system_error
and says you need to enable multithreading. This is due to the --as-needed
flag. GCC passes --as-needed
to the linker by default, while clang (apparently) does not. With --as-needed
, the linker will only record library dependencies that resolve a strong reference. Since all the references to pthread APIs are weak, none of them are sufficient to tell the linker that libpthread.so should be added to the dependency list (via a DT_NEEDED
entry in the dynamic table). Changing to clang or adding a -Wl,--no-as-needed
flag solves this problem, and the program will load the pthread library.
But, wait, why don't you need to do this when using the Gnu linker? It uses the same rule: only a strong reference causes the library to be recorded as a dependency. The difference is that Gnu ld also considers references from other shared libraries, while gold only considers references from regular object files. It turns out that the pthread library provides overriding definitions of several libc symbols, and there are strong references from libstdc++.so
to some of those symbols (e.g., write
). Those strong references are enough to get Gnu ld to record libpthread.so
as a dependency. This is more of an accident than design; I don't think changing gold to consider references from other shared libraries would actually be a robust fix. I think the proper solution is for GCC to put --no-as-needed
in front of the -lpthread
flag when you use -pthread
.
This begs the question of why this issue doesn't come up all the time when using POSIX threads and the gold linker. But this is a small test program; a larger program is almost certain to contain strong references to some of those libc symbols that libpthread.so
overrides.
Now let's look at the second failure mode, where both Notify()
and Get()
block indefinitely if you link libodr.so
with g++, gold and -lpthread
.
In Notify()
, you're holding the lock through the end of the function, while you call cv.notify_one()
. You really only need to hold the lock to set the ready flag; if we change it so that we release the lock before that, then the thread calling Get()
will timeout after 300 ms, and does not block. So it's really the call to notify_one()
that's blocking, and the program is deadlocking because Get()
is waiting on that same lock.
So why does it block only when __pthread_key_create
is FUNC
instead of NOTYPE
? I think the type of the symbol is a red herring, and that the real problem is caused by the fact that gold doesn't record the symbol versions for references resolved by a library that isn't added as a needed library. The implementation of wait_for
calls pthread_cond_timedwait
, which has two versions in both libpthread
and libc
. It's possible that the loader is binding the reference to the wrong version, causing a deadlock by failing to unlock the mutex. I made a temporary patch to gold to record those versions, and that made the program work. Unfortunately, that's not a solution, as that patch can cause ld.so to crash under other circumstances.
I tried changing cv.wait_for(...)
to cv.wait(lock, []{ return ready; })
, and the program runs perfectly in all scenarios, which further suggests that the problem is with pthread_cond_timedwait
.
The bottom line is that adding the --no-as-needed
flag will fix the problem for this very small test case. Anything larger is likely to work without the extra flag, as you'll be increasing the odds of making a strong reference to a symbol in libpthread
. (For example, adding a call to std::this_thread::sleep_for
anywhere in odr.cpp
adds a strong reference to nanosleep
, which puts libpthread
in the needed list.)
Update: I've verified that the failing program is linking to the wrong version of pthread_cond_timedwait
. For glibc 2.3.2, the pthread_cond_t
type was changed, and the old versions of the APIs that use the type were changed to dynamically allocate a new (bigger) structure and store a pointer to it in the original type. So now, if the consuming thread reaches cv.wait_for
before the producing thread reaches cv.notify_one
, the implementation of cv.wait_for
calls the old version of pthread_cond_timedwait
, which initializes what it thinks is an old pthread_cond_t
in cv
with a pointer to a new pthread_cond_t
. After that, when the other thread reaches cv.notify_one
, its implementation assumes that cv
contains a new-style pthread_cond_t
rather than a pointer to one, so it calls pthread_mutex_lock
with the pointer to the new pthread_cond_t
instead of the pointer to the mutex. It locks that would-be mutex, but it never gets unlocked because the other thread unlocks the real mutex.