performanceclojurechanneltransducer

Understanding Clojure Transducer Performance


At a high level I understood that using a transducer does not create any intermediate data structures whereas a long chain of operations via ->> does and thus the transducer method is more performant. This is proven out as true in one of my examples below. However when I add clojure.core.async/chan to the mix I do not get the same performance improvement I expect. Clearly there is something that I don't understand and I would appreciate an explanation.

(ns dev
  (:require [clojure.core.async :as async]

            [criterium.core :as crit]))

;; Setup some toy data.
(def n 1e6)
(def data (repeat n "1"))


;; Reusable thread-last operation (the "slower" one).
(defn tx [x]
  (->> x
       (map #(Integer. %))
       (map inc) (map inc) (map inc) (map inc) (map inc) (map inc)
       (map inc) (map inc) (map inc) (map inc) (map inc)))

;; Reusable transducer (the "faster" one).
(def xf (comp
          (map #(Integer. %))
          (map inc) (map inc) (map inc) (map inc) (map inc) (map inc)
          (map inc) (map inc) (map inc) (map inc) (map inc)))

;; For these first two I expect the second to be faster and it is.
(defn nested []
  (last (tx data)))

(defn into-xf []
  (last (into [] xf data)))

;; For the next two I again expect the second to be faster but it is NOT.
(defn chan-then-nested []
  (let [c (async/chan n)]
    (async/onto-chan! c data)
    (->> c
         (async/into [])
         async/<!!
         tx
         last)))

(defn chan-xf []
  (let [c (async/chan n xf)]
    (async/onto-chan! c data)
    (->> c
         (async/into [])
         async/<!!
         last)))

(comment

  (crit/quick-bench (nested)) ; 1787.672 ms
  (crit/quick-bench (into-xf)) ; 822.8626 ms
  (crit/quick-bench (chan-then-nested)) ; 1535.628 ms
  (crit/quick-bench (chan-xf)) ; 2072.626 ms

  ;; Expected ranking fastest to slowest
  ;; into-xf
  ;; nested
  ;; chan-xf
  ;; chan-then-nested

  ;; Actual ranking fastest to slowest
  ;; into-xf
  ;; chan-then-nested
  ;; nested
  ;; chan-xf

  )

In the end there are two results I don't understand. First, why is using a transducer with a channel slower than reading everything off the channel and then doing nested maps? It appears that the "overhead", or whatever it is, of using a transducer with a channel is so much slower that it overwhelms the gains of not creating intermediate data structures. Second, and this one was really unexpected, why is it faster to put the data onto a channel and then take it off and then use the nested map technique than it is to not do the channel dance and just use the nested map technique? (Said shorter, why is chan-then-nested faster than nested?) Could all or some of this just be an artifact of the benchmarking or randomness? (I have run quick-bench several times for each of these with the same results.) I'm wondering if it has something to do with into calling transduce whereas in the channel version is not implemented the same way at all. The transducer provides the same interface for applying a transformation across vectors or channels but how that transformation is applied is different; and that difference makes all the difference.


Solution

  • Some remarks on your methodology:

    After making those changes, here is the benchmark data I see:

    Evaluation count : 14688 in 6 samples of 2448 calls.
                 Execution time mean : 39.978735 µs
        Execution time std-deviation : 1.238587 µs
       Execution time lower quantile : 38.870558 µs ( 2.5%)
       Execution time upper quantile : 41.779784 µs (97.5%)
                       Overhead used : 10.162171 ns
    Evaluation count : 20094 in 6 samples of 3349 calls.
                 Execution time mean : 30.557295 µs
        Execution time std-deviation : 562.641738 ns
       Execution time lower quantile : 29.936152 µs ( 2.5%)
       Execution time upper quantile : 31.330094 µs (97.5%)
                       Overhead used : 10.162171 ns
    Evaluation count : 762 in 6 samples of 127 calls.
                 Execution time mean : 740.642963 µs
        Execution time std-deviation : 176.879454 µs
       Execution time lower quantile : 515.588780 µs ( 2.5%)
       Execution time upper quantile : 949.109898 µs (97.5%)
                       Overhead used : 10.162171 ns
    
    Found 2 outliers in 6 samples (33.3333 %)
        low-severe   1 (16.6667 %)
        low-mild     1 (16.6667 %)
     Variance from outliers : 64.6374 % Variance is severely inflated by outliers
    Evaluation count : 816 in 6 samples of 136 calls.
                 Execution time mean : 748.782942 µs
        Execution time std-deviation : 7.157018 µs
       Execution time lower quantile : 740.139618 µs ( 2.5%)
       Execution time upper quantile : 756.102312 µs (97.5%)
                       Overhead used : 10.162171 ns
    

    The key takeaways are:

    1. Async overhead dominates actual processing time. Both of the channel versions are much slower than either non-channel version, so we no longer have to worry about "why is it faster to put the whole thing onto a channel and then take it off again".
    2. The difference between chan-then-nested and chan-xf is much smaller than in your version. chan-xf is still a tiny bit slower, but easily within one standard deviation: not really a remarkable result.