I have a program which requires a DNS query and a sqlite3 DB connection.
I have determined that it hangs indefinitely at a getaddrinfo()
call. So I created a test program (from busybox's nslookup.c
) with only this call. When I do not link the libsqlite3
it works as expected. The code segment is as follows:
#include <arpa/inet.h>
#include <netdb.h>
#include <resolv.h>
#include <string.h>
#include <signal.h>
static int sockaddr_to_dotted(struct sockaddr *saddr, char *buf, int buflen)
{
if (buflen <= 0) return -1;
buf[0] = '\0';
if (saddr->sa_family == AF_INET)
{
inet_ntop(AF_INET, &((struct sockaddr_in*)saddr)->sin_addr, buf, buflen);
return 0;
}
if (saddr->sa_family == AF_INET6)
{
inet_ntop(AF_INET6, &((struct sockaddr_in6*)saddr)->sin6_addr, buf, buflen);
return 0;
}
return -1;
}
static int print_host(const char *hostname, const char *header)
{
char str[128]; /* IPv6 address will fit, hostnames hopefully too */
struct addrinfo *result = NULL;
int rc;
struct addrinfo hint;
memset(&hint, 0, sizeof(hint));
/* hint.ai_family = AF_UNSPEC; - zero anyway */
/* Needed. Or else we will get each address thrice (or more)
* for each possible socket type (tcp,udp,raw...): */
hint.ai_socktype = SOCK_STREAM;
// hint.ai_flags = AI_CANONNAME;
printf("BEFORE GETADDRINFO\n");
rc = getaddrinfo(hostname, NULL /*service*/, &hint, &result);
printf("AFTER GETADDRINFO\n");
if (!rc)
{
struct addrinfo *cur = result;
// printf("%s\n", cur->ai_canonname); ?
while (cur)
{
sockaddr_to_dotted(cur->ai_addr, str, sizeof(str));
printf("%s %s\nAddress: %s\n", header, hostname, str);
str[0] = ' ';
if (getnameinfo(cur->ai_addr, cur->ai_addrlen, str + 1,
sizeof(str) - 1, NULL, 0, NI_NAMEREQD))
str[0] = '\0';
puts(str);
cur = cur->ai_next;
}
}
else
{
printf("getaddrinfo('%s') failed: %s", hostname, gai_strerror(rc));
}
freeaddrinfo(result);
return (rc != 0);
}
int main(int argc, char **argv)
{
if (argc != 2)
return -1;
res_init();
return print_host(argv[1], "Name: ");
}
I can only see "BEFORE GETADDRINFO" on the output. I also tried to strace the program. (My dns server is 192.168.11.11, and queried "www.google.com") This is where it suspends:
socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.11.11")}, 16) = 0
send(3, "\0\2\1\0\0\1\0\0\0\0\0\0\3www\6google\3com\0\0\1\0\1", 32, 0) = 32
pselect6(4, [3], NULL, NULL, {10, 0}, 0) = 1 (in [3], left {9, 988000000})
recv(3, "\0\2\201\200\0\1\0\5\0\0\0\0\3www\6google\3com\0\0\1\0"..., 512, 0) = 112
close(3) = 0
rt_sigprocmask(SIG_SETMASK, NULL, [RTMIN], 8) = 0
rt_sigsuspend([]
My compiler is bfin-linux-uclibc-gcc
(gcc version 4.1.2)
I cross compiled sqlite3 for bfin-linux-uclibc
(version 3.6.23)
I appreciate any comment, help, debug procedure suggestion.
output of strace -e trace=file mybinary
:
stat("/etc/ld.so.cache", {st_mode=S_IFREG|0644, st_size=1073, ...}) = 0
open("/etc/ld.so.cache", O_RDONLY) = 3
open("/lib/libsqlite3.so.0", O_RDONLY) = 3
open("/lib/libstdc++.so.6", O_RDONLY) = 3
open("/lib/libm.so.0", O_RDONLY) = 3
open("/lib/libgcc_s.so.1", O_RDONLY) = 3
open("/lib/libc.so.0", O_RDONLY) = 3
open("/lib/libdl.so.0", O_RDONLY) = 3
open("/lib/libpthread.so.0", O_RDONLY) = 3
open("/lib/libgcc_s.so.1", O_RDONLY) = 3
open("/lib/libc.so.0", O_RDONLY) = 3
open("/lib/libm.so.0", O_RDONLY) = 3
open("/lib/libgcc_s.so.1", O_RDONLY) = 3
open("/lib/libc.so.0", O_RDONLY) = 3
open("/lib/libc.so.0", O_RDONLY) = 3
open("/lib/libc.so.0", O_RDONLY) = 3
open("/lib/libc.so.0", O_RDONLY) = 3
open("/lib/libc.so.0", O_RDONLY) = 3
stat("/lib/ld-uClibc.so.0", {st_mode=S_IFREG|0755, st_size=29824, ...}) = 0
open("/etc/resolv.conf", O_RDONLY) = 3
open("/etc/hosts", O_RDONLY) = 3
Output of bfin-linux-uclibc-nm -g mybinary
00004fc4 A ___bss_start
w ___deregister_frame_info@@GCC_3.0
00004f10 D ___dso_handle
00004fc4 A __edata
00004fe0 A __end
00000d60 T __fini
U _freeaddrinfo
U _gai_strerror
U _getaddrinfo
U _getnameinfo
U _inet_ntop
00000534 T __init
w __Jv_RegisterClasses
00000aa4 T _main
U _printf
U _puts
w ___register_frame_info@@GCC_3.0
U ___res_init
00000e18 R __ROFIXUP_END__
00000de0 R __ROFIXUP_LIST__
00000670 T ___self_reloc
00020000 A __stacksize
0000060c T __start
U ___uClibc_main
Updated information shows libpthread
being loaded, so the scenario is likely SQLite was built with pthread support enabled (default on most platforms), and your binary was not.
The clue is the presence of libpthread and the hang at rt_sigsuspend()
, this is an explicit wait for a signal, and is very likely one thread waiting for another thread to exit, which never happens of course.
The background to this is that since C and the standard library/libc pre-date contemporary threading, there are many cases where the standard library or API is either not re-entrant or not thread-safe, or both. Back when dragons roamed the land it was common for the programmer to have to explicitly call alternate versions of such functions (names suffixed with "_r") or use alternate libraries (again usually with an "_r" suffix) to ensure that code behaved correctly. pthreads changed the programming interface for the better, but since thread-safety comes at a cost (performance, sometime substantial, and code size) it's not enabled unless you ask for it.
When you use -pthread
at least two things usually happen:
_REENTRANT
is defined as a preprocessor macro, this may change compile time behaviourlibpthread
is linked in (equivalent to -lpthread
), this will change run-time behaviourIt would take some non-trivial debugging to be certain, but what probably happened is that your binary ended up mixing the stub pthread functions in uClibc with a handful of the real pthread functions. This is because libpthread was not loaded explicitly, only the pthread symbols referenced by libsqlite were imported.
uClibc contains (as does glibc) dummy pthread functions (run nm
on libc.so
to see), these are defined as "weak" symbols, when the real libpthread is loaded explicitly it takes over all entry points with its "strong" symbols. (These stubs exists so that thread-aware libraries can work with non-threaded programs without changes.)
Building your binary with an explicit -pthread
eliminates this mismatch, and resolves the issue.
For debugging:
Run nm -g
and ldd
(the uClibc version) against your compiled binary, and check which symbols are in which library, and see if you can spot a mismatch. Setting LD_DEBUG=all
when running your program should be useful too (you'll probably want to redirect stderr for that, there will be a lot of output).
The SQLite library has a .init
section, but as far as I can tell it's a stub that doesn't call any internal functions, so simply linking shouldn't cause SQLite code to execute.
Since SQLite uses threads, make sure you built thread-safe, and are using the .so
dynamic library.
When you link against your build of SQLite, make sure you use both -L
(compile-time) and -R
(run-time) library paths, usually something like this before compile & link will do the trick (amend the path as needed):
export CFLAGS=-L/usr/local/sqlite3/lib
export LDFLAGS=-R/usr/local/sqlite3/lib
Test program:
#include<stdio.h>
#include<sqlite3.h>
int main(int argc,char *argv[]) {
printf("SQLite version (compile): %s\n",SQLITE_VERSION);
printf("SQLite version (API): %s\n",sqlite3_libversion());
}
If you run this and get different versions, then something is definitely wrong with your build environment.
These guesses don't directly solve this problem, but I'll leave them here for the record:
Normally my first guess would usually be an NSS library run-time/compile-time library mismatch: as you're using the system getaddrinfo()
NSS (name service switch) is involved. This will dlopen()
various libraries to support various user/group/host databases, depending on /etc/nsswitch.conf
, including local file, DNS, LDAP, Berkeley and quite possibly SQLite. Since uClibc doesn't support this (glibc style libnss_xxx.so
), that's one thing ruled out...
There's another possibility: PAM does something similar, and may load an incompatible library (BerkeleyDB or possibly SQLite, as used by pam_userdb
or pam-sqlite
). Neither uClibc nor SQLite use PAM though, and it's improbable that it's being linked by accident.)
Since dlopen()
is used you won't see such libraries (NSS or PAM) with ldd
, running under strace -e trace=file
should help to confirm what libraries are being used, without the usual volume of output.