I get occasional hangs running a particular development version of a program that doesn't seem to hang using the official version. The dev version differs mainly in that it brings in a lot more Go std libraries, which (for the most part) it does not exercise; so the executable is larger, plus static-var and init() initializations are done, which might increase the likelihood of hitting some race condition.
git bisect run
identified the (golang) culprit as 6becb033341602f2df9d7c55cc23e64b925bbee2
:
Author: Ian Lance Taylor <iant@golang.org>
Date: Thu Apr 11 16:53:11 2019 -0700
[...]
runtime: switch to using new timer code
diff --git a/src/runtime/time.go b/src/runtime/time.go
index fea5d6871c..db48a932d4 100644
--- a/src/runtime/time.go
+++ b/src/runtime/time.go
@@ -14,7 +14,7 @@ import (
)
// Temporary scaffolding while the new timer code is added.
-const oldTimers = true
+const oldTimers = false
// Package time knows the layout of this structure.
// If this struct changes, adjust ../time/sleep.go:/runtimeTimer.
Having glanced at the differences this small change introduces, I'm leaning strongly towards there being some race condition in this "new timer code" and/or code it enables.
Whether via Ctrl-\ (SIGQUIT) or delve attach
, the culprit always seems to be the cmd.Start()
call here:
func sh(dir string, stdin io.Reader, stdout io.Writer, stderr io.Writer, name string, args []string) Object {
cmd := exec.Command(name, args...)
cmd.Dir = dir
cmd.Stdin = stdin
var stdoutBuffer, stderrBuffer bytes.Buffer
if stdout != nil {
cmd.Stdout = stdout
} else {
cmd.Stdout = &stdoutBuffer
}
if stderr != nil {
cmd.Stderr = stderr
} else {
cmd.Stderr = &stderrBuffer
}
err := cmd.Start()
PanicOnErr(err)
The stack traces from there look quite similar, until syscall/exec_unix.go (in the Go source tree) is reached. Then, in Delve, it's the forkAndExecInChild() call that seems to hang, while Ctrl-\ shows the readlen() call as hung:
// Kick off child.
pid, err1 = forkAndExecInChild(argv0p, argvp, envvp, chroot, dir, attr, sys, p[1])
if err1 != 0 {
err = Errno(err1)
goto error
}
ForkLock.Unlock()
// Read child error status from pipe.
Close(p[1])
n, err = readlen(p[0], (*byte)(unsafe.Pointer(&err1)), int(unsafe.Sizeof(err1)))
Close(p[0])
if err != nil || n != 0 {
The forkAndExecInChild()
code seems to be hung at exec_darwin.go:206
, which is a syscall to libc_dup2_trampoline
within a loop. Assuming that's just a call to dup2()
, I can't think of any reason it'd hang; but I've "caught" a hung test run there (and nowhere else) at least twice, via delve
, though perhaps that's just an artifact of using delve attach <pid> ...
versus Ctrl-\ (SIGQUIT)?
I've debugged and fixed various issues around such activities over the years (ok, decades), but am relatively new to the Go ecosystem, and don't want to submit a bug report before I have some understanding of what's going on.
In particular, Cmd.Start() is documented thusly:
Start starts the specified command but does not wait for it to complete.
So, on the face of it, it seems strange, if not outright buggy, that these hangs seem to point to that very call as the culprit. I.e. if it doesn't wait, why would it hang? Perhaps what looks like a direct OS call actually checks with the Go threads mechanism before or after the underlying OS call, and is getting hung there.
The problem shows up when running the test suite, which normally takes about 12 seconds to run. I've run this in loops lasting roughly 5 hours to do the git bisect run
; though it often triggers within 15 minutes, I've seen it take over 3 hours to do so.
If anyone wants to delve (ha!) more deeply into this and maybe try to reproduce it, the program I'm working on is "Joker", and here's the development version (my fork):
https://github.com/jcburley/joker/ (See branch gostd
; build via ./run.sh
.)
The problem occurs (occasionally), on OS X, when running ./all-tests.sh
. So far, the hangs have happened only when that script runs ./flag-tests.sh
or ./linter-tests.sh
, not yet ./eval-tests.sh
(which seems strange as well, since that always gets run first due to the alphabetical ordering).
The same test suite ran in a loop on my Ubuntu Linux (Ryzen 3) dev box, without hanging, for over 24 hours. Windows 7 looping has been going on for several hours as well, with no hang so far.
REPRO UPDATES:
6a569f243e028f823a9f20bfd9da7bdfab8699a4
git bisect run
identified the (golang) culprit as 6becb033341602f2df9d7c55cc23e64b925bbee2
; double-checked that and previous commit (by running five instances of the latter for many hours), looks like a solid resultThe Joker executable is much bigger for the dev version compared to the official (master/released) version; though most of that additional code is unexercised by this little test suite, it's possible some init() or static-var-init code, being run due to additional Go std libraries (packages) being pulled in, could contribute moreso (if not entirely) than sheer size and size-related issues by dint of launching additional go and/or OS threads, increasing contention, etc.
This was a bug on Go for MacOS and got fixed in https://go-review.googlesource.com/c/go/+/372798/
For affected Go versions prior to the fix, the workaround is to pass -Wl,-bind_at_load
to the linker, which can be done by invoking go with -ldflags="-extldflags=-Wl,-bind_at_load"