There is a native Erlang node that has been running flawlessly in outbound mode with a FreeSWITCH v1.10.1 instance and its mod_erlang_event
C-node for years, but when updating FreeSWITCH to v1.10.12 and Erlang to release 25, mod_erlang_event
's C-node exits when the time comes to answer the call, and the few changes to mod_erlang_event
over the years don't explain this behaviour.
Was able to narrow down the issue to mod_erlang_event
:971-988 after resorting to "caveman debugging" with print statements:
static void listener_main_loop(listener_t *listener)
{
int status = 1;
while ((status >= 0 || erl_errno == ETIMEDOUT || erl_errno == EAGAIN) && !prefs.done) {
/* .. */
status = ei_xreceive_msg_tmo(listener->sockdes, &msg, &buf, 1);
switch_log_printf(/* .. */, "listener->sockdes: %d\n", listener->sockdes);
switch_log_printf(/* .. */,
"ei_xreceive_msg_tmo returned status=%d, erl_errno=%d, errno=%d\n",
status, erl_errno, errno);
/* .. */
Until something noteworthy happens, the loop above endlessly returns
listener->sockdes: 71
ei_xreceive_msg_tmo returned status=-1, erl_errno=110, errno=14
(according to error.h
, 110
is ETIMEDOUT
(connection timed out) and 14
is EFAULT
(bad address)) which seems to be normal behaviour according to the ei_connect
docs)
but when things go awry, it outputs
ei_xreceive_msg_tmo returned status=1, erl_errno=0, errno=22
where I think errno=22
is EINVAL
(invalid argument) and it probably indicates that ei_xreceive_msg_tmo()
received wrong input from the surrounding code (i.e., FreeSWITCH)?
update: As pointed out in the comments, ei_xreceive_msg_tmo()
timeout possibly needs to be raised, but that wasn't the root cause, and now I get:
listener exit: status=-1, erl_errno=0 errno=14`
With that said, @JohnBollinger answered the question, and if anyone is interested in the issue itself, it is tracked here: signalwire/freeswitch
issue #2808.
My takeaway is that if erl_errno
is 0, it means that functions used from Erlang C libraries (e.g., ei
) probably did not fail and the issue is possibly in the one's own code (or in the codebase one tries to integrate with Erlang).
Also, to recap:
ei_xreceive_msg_tmo returned status=1, erl_errno=0, errno=22
This was possibly fixed by raising the timeout from 1 ms.
listener exit: status=-1, erl_errno=0 errno=14
mod_erlang_event
's main loop assumed that erl_errno
will always hold the latest failure codes (e.g., EAGAIN
or ETIMEDOUT
) based on a bug in ei
, but this was fixed in Erlang 22, and the loop condition exited when erl_errno
was re-set to 0. See issue #2808 for the specifics.
Thank you, @ticktalk and @JohnBollinger! It was indeed the timeout, and changing it to 100 solved the issue (and I was ready to rule it out...).
John, I truly appreciate your comments and answer for broadening my perspective and your taking the time & digging into the FreeSWITCH changelogs. Can't tell you how many times I perused that page and the GitHub ones as well... This taught me a valuable lesson.
If an Erlang C-node exits while
erl_errno
is 0, could that be construed that the issue is probably not Erlang related?
That would not be a safe assumption in general.
If a C node exits from inside a call to an Erlang-API function, then erl_errno
may nevertheless be zero at that point, yet the issue is definitely Erlang related, even if the root is on the C side. And if a C node exits outside any call to an Erlang-API function then that might still be because an Erlang function did something that the C node did not expect, and therefore did not handle properly, whether the API call was successful or not. That's still Erlang related.
It is not clear from the code posted that you are observing such circumstances in your case. You provide evidence that erl_errno
is zero at a particular point in the node's execution, but that doesn't prove that it is still zero when the node exits.
However, even if we assume that there are no other calls to the Erlang API between the one shown and the C node's termination, such that the failure occurs outside the Erlang API with erl_errno
still zero, there is still the possibility that the C node is tripping over something that ei_xreceive_msg_tmo()
did when it ran. We can't speak to what exactly that might be, because you have not presented any relevant code, but to the question actually posed: no.
Inasmuch as the code presented seems focused on making calls into the Erlang API and (presumably) processing the results obtained, it seems overwhelmingly likely that any issue it has is Erlang-related in some way. Especially so if the same C node code works as expected in a different environment, which the question seems to imply.