Cannot explain alloc/op in benchmark result

I have pretty basic benchmark comparing performance of mutex vs atomic:

const (
    numCalls = 1000
)

var (
    wg                  sync.WaitGroup
)

func BenchmarkCounter(b *testing.B) {
    var counterLock sync.Mutex
    var counter int
    var atomicCounter atomic.Int64

    b.Run("mutex", func(b *testing.B) {
        wg.Add(b.N)
        for i := 0; i < b.N; i++ {
            go func(wg *sync.WaitGroup) {
                for i := 0; i < numCalls; i++ {
                    counterLock.Lock()
                    counter++
                    counterLock.Unlock()
                }
                wg.Done()
            }(&wg)
        }
        wg.Wait()
    })

    b.Run("atomic", func(b *testing.B) {

        wg.Add(b.N)
        for i := 0; i < b.N; i++ {
            go func(wg *sync.WaitGroup) {
                for i := 0; i < numCalls; i++ {
                    atomicCounter.Add(1)
                }
                wg.Done()
            }(&wg)
        }
        wg.Wait()
    })
}

Typical output of go test -bench. -benchmem looks as follows:

BenchmarkCounter/mutex-8        7680        188508 ns/op         618 B/op          3 allocs/op
BenchmarkCounter/atomic-8      38649         31006 ns/op          40 B/op          2 allocs/op

Running escape analysis with go test -gcflags '-m' show that one allocation in each benchmark iteration (op) belongs with running goroutine:

./counter_test.go:57:17: func literal escapes to heap
./counter_test.go:60:7: func literal escapes to heap
./counter_test.go:72:18: func literal escapes to heap
./counter_test.go:75:7: func literal escapes to heap

(lines 57 and 72 are b.Run() calls, and lines 60 and 75 are go func() calls, so exactly 1 call within each of b.N iteration)

The same analysis shows that variables declared at the beginning of the benchmark function are also moved to heap:

./counter_test.go:21:6: moved to heap: counterLock
./counter_test.go:22:6: moved to heap: counter
./counter_test.go:23:6: moved to heap: atomicCounter

I'm just fine with that. What really bothers me is that I expect alloc/op measure memory allocations per iteration (b.N iterations in total). So, for example, one allocation of, say, counterLock divided by b.N iterations (7.680 in the benchmark output above) should add 1/7.680 = 0 (rounding division result to closest integer). Same should apply to counter and atomicCounter.

However, this is not the case, and I get 3 allocations instead of just 1 for "mutex" benchmark (1 goroutine + counterLock + counter) and 2 for "atomic" (1 goroutine + atomicCounter). It seems thus that benchmarking logic considers function scope variables (counterLock, counter, atomicCounter) being allocated anew during each of b.N iterations, not just once at the beginning of BenchmarkCounter(). Is this logic correct? Am I missing something here?

EDIT. Investigating memprofile with pprof shows allocations for go func() only:

Solution

Starting goroutines is allocating memory for the stack, so your go func accounts for two allocs per loop, one for the function stack, one for the argument evaluation. I have to check how exactly the memory layout would be, but remember that is is “go expression”, so your function value and parameters have to be evaluated first. One allocation goes away when you use the (global) wait group with func() {...}() instead of func(wg *sync.WaitGroup) {...}(&wg).

When you have 1,000 goroutines fighting for a mutex, you'll run into lockSlow/unlockSlow, which is also not allocation free. you can easily test that with:

    b.Run("mutex", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            wg.Add(1)
            go func(wg *sync.WaitGroup) {
                for i := 0; i < numCalls; i++ {
                    counterLock.Lock()
                    counter++
                    counterLock.Unlock()
                }
                wg.Done()
            }(&wg)
            wg.Wait()
        }
    })

That would explain the three and two allocations per loop you are seeing.