c++clinuxalpine-linuxstack-corruption

Program segfaults on alpine linux. How do I resolve it?


I've been working on a webrtc datachannel library in C/C++ and wrote a program in C to:

  1. Create two peers from the same process.
  2. Establish a connection between them.
  3. Close the connection if it's successful.

Everything runs fine on a debian docker container and on my host opensuse tumbleweed (all x86_64 and 64bit), but on alpine linux container (64bit x86_64), I'm getting a SEGFAULT inside the child processes:

segfault in libnice

The function above is from the program's dependency "libnice". It seems like *agent == NULL and there is no way that is made null in the caller's scope. I even inserted a printf("Argument is %p", agent); right before the function call and it prints out its memory and I can verify it's not null. From the disassembly, it looks like the line where copying the agent's contents (0x557a1d20) as the local variable in the callee's stack results in a segfault. The segfault always occurs at this point even after a make clean and recompilation. Fail at activation record? Stack corruption?

UPDATE: I made a more lightweight container and ran it, and now it segfaults at a different place in that same priv_conn_keepalive_tick_unlocked. The argument seems to be set though (Notice the 0x7ffff7f9ad08): segfault2

Since I thought I might be hitting the libmusl's default stack limit of 80k, I used getrlimit(RLIMIT_STACK, &rl) to obtain the stack size and it looks like it's already 8 MB and not 80k. Increasing this limit further does not seem to make any difference except that if I assign more than 8 MB, my program crashes early inside the Gdb. Gdb says it got an unknown signal "? ?"; outside the gdb, it crashes at the normal point where it normally crashes without the altered stack size.

I'm not sure what exactly the problem is (stack corruption?) and what to do next to resolve this.

Here's my program's flow:

For every peer that is created, a child process is created with a fork(). Parent <--> child communication is done by ZeroMQ and I use protocol buffers to forward any callbacks (and its arguments) that are triggered inside the child onto an event loop running in the parent process.

So for the above program, there are 2 child processes and 1 parent process.

Steps to reproduce:


Solution

  • On further investigation, the crash is in an instruction writing at a mildly large negative offset from the stack base pointer, so it's probably just a simple stack overflow.

    The right way to fix this is reducing the excess stack usage or explicitly requesting a large stack at pthread_create time, but I don't see where pthread_create is being called from. A quick check to verify that this is the problem would be to override the default stack size for new threads by performing the following somewhere early in the program:

    pthread_attr_t attr;
    pthread_attr_init(&attr);
    pthread_attr_setstacksize(&attr, 1<<20); // 1 MB
    pthread_setattr_default_np(&attr);