I have pretty basic benchmark comparing performance of mutex vs atomic:
const (
numCalls = 1000
)
var (
wg sync.WaitGroup
)
func BenchmarkCounter(b *testing.B) {
var counterLock sync.Mutex
var counter int
var atomicCounter atomic.Int64
b.Run("mutex", func(b *testing.B) {
wg.Add(b.N)
for i := 0; i < b.N; i++ {
go func(wg *sync.WaitGroup) {
for i := 0; i < numCalls; i++ {
counterLock.Lock()
counter++
counterLock.Unlock()
}
wg.Done()
}(&wg)
}
wg.Wait()
})
b.Run("atomic", func(b *testing.B) {
wg.Add(b.N)
for i := 0; i < b.N; i++ {
go func(wg *sync.WaitGroup) {
for i := 0; i < numCalls; i++ {
atomicCounter.Add(1)
}
wg.Done()
}(&wg)
}
wg.Wait()
})
}
Typical output of go test -bench. -benchmem
looks as follows:
BenchmarkCounter/mutex-8 7680 188508 ns/op 618 B/op 3 allocs/op
BenchmarkCounter/atomic-8 38649 31006 ns/op 40 B/op 2 allocs/op
Running escape analysis with go test -gcflags '-m'
show that one allocation in each benchmark iteration (op) belongs with running goroutine:
./counter_test.go:57:17: func literal escapes to heap
./counter_test.go:60:7: func literal escapes to heap
./counter_test.go:72:18: func literal escapes to heap
./counter_test.go:75:7: func literal escapes to heap
(lines 57 and 72 are b.Run()
calls, and lines 60 and 75 are go func()
calls, so exactly 1 call within each of b.N
iteration)
The same analysis shows that variables declared at the beginning of the benchmark function are also moved to heap:
./counter_test.go:21:6: moved to heap: counterLock
./counter_test.go:22:6: moved to heap: counter
./counter_test.go:23:6: moved to heap: atomicCounter
I'm just fine with that. What really bothers me is that I expect alloc/op
measure memory allocations per iteration (b.N
iterations in total). So, for example, one allocation of, say, counterLock
divided by b.N
iterations (7.680 in the benchmark output above) should add 1/7.680 = 0 (rounding division result to closest integer). Same should apply to counter
and atomicCounter
.
However, this is not the case, and I get 3 allocations instead of just 1 for "mutex" benchmark (1 goroutine + counterLock
+ counter
) and 2 for "atomic" (1 goroutine + atomicCounter
). It seems thus that benchmarking logic considers function scope variables (counterLock
, counter
, atomicCounter
) being allocated anew during each of b.N
iterations, not just once at the beginning of BenchmarkCounter()
. Is this logic correct? Am I missing something here?
EDIT. Investigating memprofile
with pprof
shows allocations for go func()
only:
Starting goroutines is allocating memory for the stack, so your go func
accounts for two allocs per loop, one for the function stack, one for the argument evaluation. I have to check how exactly the memory layout would be, but remember that is is “go expression
”, so your function value and parameters have to be evaluated first. One allocation goes away when you use the (global) wait group with func() {...}()
instead of func(wg *sync.WaitGroup) {...}(&wg)
.
When you have 1,000 goroutines fighting for a mutex, you'll run into lockSlow
/unlockSlow
, which is also not allocation free. you can easily test that with:
b.Run("mutex", func(b *testing.B) {
for i := 0; i < b.N; i++ {
wg.Add(1)
go func(wg *sync.WaitGroup) {
for i := 0; i < numCalls; i++ {
counterLock.Lock()
counter++
counterLock.Unlock()
}
wg.Done()
}(&wg)
wg.Wait()
}
})
That would explain the three and two allocations per loop you are seeing.