How do I measure performance of Lisp code?

I'm trying to discern how much performance gains I would achieve by converting Lisp (SBCL) code to assembly language. The time function on SBCL keeps giving counterintuitive results even for trivial operations.

On the one hand, running this implementation

(defun test()
  (declaim (optimize (speed 3) (safety 0) (debug 0)))
  (setf my-array (make-array 10))
  (setf (aref my-array 2) 4)
  (time (progn 
          (aref my-array 1) 
          (aref my-array 3)))
)

reports roughly between 2,200 and 3,000 clock cycles, which in and of itself seems excessive for just a pair of straightforward lookups in a small array.

On the other hand, running this variant under similar conditions

(defun test()
  (declaim (optimize (speed 3) (safety 0) (debug 0)))
  (setf my-array (make-array 10))
  (setf (aref my-array 2) 4)
  (time (progn 
          (aref my-array 1) 
          (aref my-array 3)))
  (time (progn 
          (aref my-array 1) 
          (aref my-array 3))) ;just duplicated previous line
)

results in each call to time reporting between 1,200 and 2,000 clock cycles. Hence the performance is twofold odd: still too many cycles for trivial operations, yet a noticeable difference between calling time once versus twice.

In both versions I placed time inside the function body in order to obviate the overhead of loading the function and preparing for execution. Also I'm omitting the outliers because otherwise the matter looks more volatile. Ranges vary between Linux (Ubuntu 22.4) and Windows, but the issue is essentially the same.

Obviously this method doesn't look promising for deciding on code of much greater complexity. Am I missing something? or is the time built-in function just unreliable for performance measurement?

Solution

The first rule of benchmarking is to measure code which is not hopeless. To do this you both need to write code you think is not hopeless, and listen to what the compiler tells you about it to ensure it really is not (this is particularly important for SBCL).

Here is a version of your test function which is not hopeless:

(defun test ()
  (declare (optimize (speed 3) (safety 0) (debug 0)))
  (let ((my-array (make-array 10)))
    (setf (aref my-array 2) 4)
    (values (aref my-array 1)
            (aref my-array 3))))

This is at least legal, and does not ignore the return values. The SBCL compiler also will compile this with no notes.

Note that measuring the performance of this will measure the allocation of the array as well as two, or three lookups. The function will also return completely undefined values because you have not initialised the array.

The second rule of benchmarking is to measure code that is in some way equivalent to the code you're comparing it with. In your case you are thinking about converting it to assembler, and you want to measure array references. OK, so unless you're planning on writing half of Common Lisp in assembler (there's a famous law...) this probably means you want to measure performance for arrays with fixed, known element types and arrays which are, in CL terms, simple. You also want to measure array reference performance, not array creation. At least, you wish to disentangle those things.

OK, so let's do that. I'll assume the array is a one-dimensional simple array of fixnums.

(deftype array-index ()
  `(integer 0 (,array-dimension-limit)))

(defun test-2 (a n m)
  (declare (optimize (speed 3) (safety 0) (debug 0))
           (type (simple-array fixnum (*)) a)
           (type array-index n m))
  (values (aref a n)
          (aref a m)))

(Don't worry about the array-index type definition: it just saves work in the type declarations for the function.)

The zeroth rule of benchmarking is to measure many iterations of something. Here is a little macro which does that:

(defun time-calls (f n report-every)
  (let ((start (get-internal-real-time)))
    (dotimes (i n)
      (when (and report-every (zerop (mod i report-every)))
        (format *debug-io* "~&~10D of ~D~%" (1+ i) n))
      (funcall f))
    (float (/ (- (get-internal-real-time) start)
              internal-time-units-per-second
              n))))

(defmacro timing ((&optional (n 1) report-every) &body forms)
  "Time a number of iterations of some forms.

The value of this form is the time elapsed per iteration, in seconds,
as a single float.

(timing () ...) times one iteration.  (timing (n) ...) times n
iterations.  (timing (n m) times n iterations printing a progress
report to *DEBUG-IO* every m.

The forms aren't compiled: you need to use this form in compiled code
for that."
  `(time-calls (lambda ()
                 ,@forms)
               ,n
               ,report-every))

So now we can write a wrapper function which will time test-2:

(defun time-test-2 (iters)
  (let ((a (make-array 10 :element-type 'fixnum :initial-element 0)))
    (declare (type (simple-array fixnum (*)) a))
    ;; Time per reference
    (/ (timing (iters)
         (test-2 a 1 2))
       2)))

OK, so we can now time this. You want to time a lot of iterations: for code like this on a modern machine a billion iterations is reasonable. You want it to take a few seconds to run.

> (time-test-2 1000000000)
4.318865e-9

This is on a M1 clocked at (I think) 3.2GHz, so this works out to about 14 clock cycles per reference.

Ah, but.

The fourth rule of benchmarking is to be careful that you are measuring what you think you are measuring. This has at least two parts:

4.1 think about what you're measuring
4.2 try to work out whether the compiler is second-quessing you because your code is too simple and it's optimized it away.

Well in this case, no, we're not measuring what we want to measure here: we're measuring at least

a function call into test-2,
perhaps a bounds check (we might want to measure this),
and an array reference.

Let's try and get rid of the first two: the first because it's just not what we want to measure, and the second because we want to compare things to assembler, which likely is not doing bounds checks.

So we can write this:

(defun test-3 (a n m)
  (declare (optimize (speed 3) (safety 0) (debug 0))
           (type (simple-array fixnum (10)) a)
           (type fixnum n m))
  (let ((s 0))
    (declare (type fixnum s))
    ;; time per iteration so divide by iterations
    (values (/ (timing (m)
                 (dotimes (i n)
                   (incf s (aref a 0))))
               n)
            ;; Return S as well to try and avoid it being optimized away
            s)))

When compiling this you have to check the first rule, because it does generate a compiler note. But it's OK: the note is abut the final division which happens exactly once.

For this:

> (test-3 (make-array 10 :element-type 'fixnum :initial-element 1)
          10000
          1000000)
9.602013e-10
10000000000

OK, this is about three clock cycles per reference.

... Except no, it's not: we're measuring the loop and increment as well. So we can measure just the loop:

(defun test-4 (v n m)
  (declare (optimize (speed 3) (safety 0) (debug 0))
           (type fixnum v n m))
  (let ((s 0))
    (declare (type fixnum s))
    ;; time per iteration so divide by iterations
    (values
     (/ (timing (m)
          (dotimes (i n)
            (incf s v)))
        n)
     s)))

Measuring this was somewhat hard: I got very variable results, depending, I think, on whether a GC happened during the run. After cranking things up a bit I got

> (test-4 0
          100000
          1000000)
3.4967895e-10
0

So this is a little over 1 clock cycle per loop iteration perhaps. That implies an array reference is perhaps about 2 clock cycles in the simple case.

Of course, you can also just check: if the code is simple enough you can just see what the compiler does. The code for the functions above is pretty big, not least because of timing and the implicit function it creates. But if you write a simple function:

(defun test-5 (a n)
  (declare (optimize (speed 3) (safety 0))
           (type (simple-array fixnum (10)) a)
           (type array-index n))
  (aref a n))

Then

> (disassemble 'test-5)
; disassembly for test-5
; Size: 24 bytes. Origin: #x7008571648                        ; test-5
; 48:       49090B8B         add TMP, R0, R1, LSL #2
; 4C:       2A1140F8         ldr R0, [TMP, #1]
; 50:       FB031AAA         mov CSP, CFP
; 54:       BF0300F1         cmp NULL, #0
; 58:       5A7B40A9         ldp CFP, LR, [CFP]
; 5C:       C0035FD6         ret
nil

The add TMP, R0, R1, LSL #2 is obviously computing the index (the array is initially in R0 I think, and n is in R1), and then ldr R0, [TMP, #1] is smashing R0 with the array element. The rest of the function is just housekeeping. You can also see that with (safety 0) there are not, in fact, any bounds checks in this code.

As a note, it's interesting to see how completely general, safe, one-dimensional array access performs. Given this:

(defun test-general (a i1 i2 &key (repeat 1000) (tries 1000))
  (declare (optimize speed))
  (declare (type (array * (*)) a)
           (type array-index i1 i2)
           (type fixnum repeat tries))
  (setf (aref a i1) repeat)
  (values
   (/ (timing (tries)
        (dotimes (i repeat)
          (setf (aref a i2) (aref a i1))))
      repeat)
   (aref a i2)))

Then there are copious compiler notes. Experimentally a read and write seem to take about 15 cycles. Declaring the array (array t (*)) reduces it to 7, and (simple-array t (*)) drops it to about 2. (simple-array * (*)) is just as bad as a completely general vector.