csqlitegetaddrinfonslookupuclinux

getaddrinfo stucks forever when linked with sqlite3


I have a program which requires a DNS query and a sqlite3 DB connection. I have determined that it hangs indefinitely at a getaddrinfo() call. So I created a test program (from busybox's nslookup.c) with only this call. When I do not link the libsqlite3 it works as expected. The code segment is as follows:

#include <arpa/inet.h>
#include <netdb.h>
#include <resolv.h>
#include <string.h>
#include <signal.h>

static int sockaddr_to_dotted(struct sockaddr *saddr, char *buf, int buflen)
{
    if (buflen <= 0) return -1;
    buf[0] = '\0';
    if (saddr->sa_family == AF_INET)
    {
        inet_ntop(AF_INET, &((struct sockaddr_in*)saddr)->sin_addr, buf, buflen);
        return 0;
    }
    if (saddr->sa_family == AF_INET6)
    {
        inet_ntop(AF_INET6, &((struct sockaddr_in6*)saddr)->sin6_addr, buf, buflen);
        return 0;
    }
    return -1;
}
static int print_host(const char *hostname, const char *header)
{
    char str[128]; /* IPv6 address will fit, hostnames hopefully too */
    struct addrinfo *result = NULL;
    int rc;
    struct addrinfo hint;

    memset(&hint, 0, sizeof(hint));
    /* hint.ai_family = AF_UNSPEC; - zero anyway */
    /* Needed. Or else we will get each address thrice (or more)
     * for each possible socket type (tcp,udp,raw...): */
    hint.ai_socktype = SOCK_STREAM;
    // hint.ai_flags = AI_CANONNAME;
    printf("BEFORE GETADDRINFO\n");
    rc = getaddrinfo(hostname, NULL /*service*/, &hint, &result);
    printf("AFTER GETADDRINFO\n");
    if (!rc)
    {
        struct addrinfo *cur = result;
        // printf("%s\n", cur->ai_canonname); ?
        while (cur)
        {
            sockaddr_to_dotted(cur->ai_addr, str, sizeof(str));
            printf("%s  %s\nAddress: %s\n", header, hostname, str);
            str[0] = ' ';
            if (getnameinfo(cur->ai_addr, cur->ai_addrlen, str + 1,
                            sizeof(str) - 1, NULL, 0, NI_NAMEREQD))
                str[0] = '\0';
            puts(str);
            cur = cur->ai_next;
        }
    }
    else
    {
        printf("getaddrinfo('%s') failed: %s", hostname, gai_strerror(rc));
    }
    freeaddrinfo(result);
    return (rc != 0);
}

int main(int argc, char **argv)
{
    if (argc != 2)
        return -1;

    res_init();
    return print_host(argv[1], "Name: ");
}

I can only see "BEFORE GETADDRINFO" on the output. I also tried to strace the program. (My dns server is 192.168.11.11, and queried "www.google.com") This is where it suspends:

socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.11.11")}, 16) = 0
send(3, "\0\2\1\0\0\1\0\0\0\0\0\0\3www\6google\3com\0\0\1\0\1", 32, 0) = 32
pselect6(4, [3], NULL, NULL, {10, 0}, 0) = 1 (in [3], left {9, 988000000})
recv(3, "\0\2\201\200\0\1\0\5\0\0\0\0\3www\6google\3com\0\0\1\0"..., 512, 0) = 112
close(3)                                = 0
rt_sigprocmask(SIG_SETMASK, NULL, [RTMIN], 8) = 0
rt_sigsuspend([]

My compiler is bfin-linux-uclibc-gcc (gcc version 4.1.2) I cross compiled sqlite3 for bfin-linux-uclibc (version 3.6.23)

I appreciate any comment, help, debug procedure suggestion.

output of strace -e trace=file mybinary:

stat("/etc/ld.so.cache", {st_mode=S_IFREG|0644, st_size=1073, ...}) = 0
open("/etc/ld.so.cache", O_RDONLY)      = 3
open("/lib/libsqlite3.so.0", O_RDONLY)  = 3
open("/lib/libstdc++.so.6", O_RDONLY)   = 3
open("/lib/libm.so.0", O_RDONLY)        = 3
open("/lib/libgcc_s.so.1", O_RDONLY)    = 3
open("/lib/libc.so.0", O_RDONLY)        = 3
open("/lib/libdl.so.0", O_RDONLY)       = 3
open("/lib/libpthread.so.0", O_RDONLY)  = 3
open("/lib/libgcc_s.so.1", O_RDONLY)    = 3
open("/lib/libc.so.0", O_RDONLY)        = 3
open("/lib/libm.so.0", O_RDONLY)        = 3
open("/lib/libgcc_s.so.1", O_RDONLY)    = 3
open("/lib/libc.so.0", O_RDONLY)        = 3
open("/lib/libc.so.0", O_RDONLY)        = 3
open("/lib/libc.so.0", O_RDONLY)        = 3
open("/lib/libc.so.0", O_RDONLY)        = 3
open("/lib/libc.so.0", O_RDONLY)        = 3
stat("/lib/ld-uClibc.so.0", {st_mode=S_IFREG|0755, st_size=29824, ...}) = 0
open("/etc/resolv.conf", O_RDONLY)      = 3
open("/etc/hosts", O_RDONLY)            = 3

Output of bfin-linux-uclibc-nm -g mybinary

00004fc4 A ___bss_start  
         w ___deregister_frame_info@@GCC_3.0  
00004f10 D ___dso_handle  
00004fc4 A __edata  
00004fe0 A __end  
00000d60 T __fini  
         U _freeaddrinfo  
         U _gai_strerror  
         U _getaddrinfo  
         U _getnameinfo  
         U _inet_ntop  
00000534 T __init  
         w __Jv_RegisterClasses  
00000aa4 T _main  
         U _printf  
         U _puts  
         w ___register_frame_info@@GCC_3.0  
         U ___res_init  
00000e18 R __ROFIXUP_END__  
00000de0 R __ROFIXUP_LIST__  
00000670 T ___self_reloc  
00020000 A __stacksize  
0000060c T __start  
         U ___uClibc_main  

Solution

  • Updated information shows libpthread being loaded, so the scenario is likely SQLite was built with pthread support enabled (default on most platforms), and your binary was not.

    The clue is the presence of libpthread and the hang at rt_sigsuspend(), this is an explicit wait for a signal, and is very likely one thread waiting for another thread to exit, which never happens of course.

    The background to this is that since C and the standard library/libc pre-date contemporary threading, there are many cases where the standard library or API is either not re-entrant or not thread-safe, or both. Back when dragons roamed the land it was common for the programmer to have to explicitly call alternate versions of such functions (names suffixed with "_r") or use alternate libraries (again usually with an "_r" suffix) to ensure that code behaved correctly. pthreads changed the programming interface for the better, but since thread-safety comes at a cost (performance, sometime substantial, and code size) it's not enabled unless you ask for it.

    When you use -pthread at least two things usually happen:

    It would take some non-trivial debugging to be certain, but what probably happened is that your binary ended up mixing the stub pthread functions in uClibc with a handful of the real pthread functions. This is because libpthread was not loaded explicitly, only the pthread symbols referenced by libsqlite were imported. uClibc contains (as does glibc) dummy pthread functions (run nm on libc.so to see), these are defined as "weak" symbols, when the real libpthread is loaded explicitly it takes over all entry points with its "strong" symbols. (These stubs exists so that thread-aware libraries can work with non-threaded programs without changes.)

    Building your binary with an explicit -pthread eliminates this mismatch, and resolves the issue.


    For debugging:

    Run nm -g and ldd (the uClibc version) against your compiled binary, and check which symbols are in which library, and see if you can spot a mismatch. Setting LD_DEBUG=all when running your program should be useful too (you'll probably want to redirect stderr for that, there will be a lot of output).

    The SQLite library has a .init section, but as far as I can tell it's a stub that doesn't call any internal functions, so simply linking shouldn't cause SQLite code to execute.

    Since SQLite uses threads, make sure you built thread-safe, and are using the .so dynamic library.

    When you link against your build of SQLite, make sure you use both -L (compile-time) and -R (run-time) library paths, usually something like this before compile & link will do the trick (amend the path as needed):

    export CFLAGS=-L/usr/local/sqlite3/lib
    export LDFLAGS=-R/usr/local/sqlite3/lib
    

    Test program:

    #include<stdio.h>
    #include<sqlite3.h>
    
    int main(int argc,char *argv[]) {
        printf("SQLite version (compile): %s\n",SQLITE_VERSION);
        printf("SQLite version (API): %s\n",sqlite3_libversion());
    }
    

    If you run this and get different versions, then something is definitely wrong with your build environment.


    These guesses don't directly solve this problem, but I'll leave them here for the record:

    Normally my first guess would usually be an NSS library run-time/compile-time library mismatch: as you're using the system getaddrinfo() NSS (name service switch) is involved. This will dlopen() various libraries to support various user/group/host databases, depending on /etc/nsswitch.conf, including local file, DNS, LDAP, Berkeley and quite possibly SQLite. Since uClibc doesn't support this (glibc style libnss_xxx.so), that's one thing ruled out...

    There's another possibility: PAM does something similar, and may load an incompatible library (BerkeleyDB or possibly SQLite, as used by pam_userdb or pam-sqlite). Neither uClibc nor SQLite use PAM though, and it's improbable that it's being linked by accident.)

    Since dlopen() is used you won't see such libraries (NSS or PAM) with ldd, running under strace -e trace=file should help to confirm what libraries are being used, without the usual volume of output.