csvclojure

Slow parsing CSV file into a map of vectors


I am trying to read in a CSV file and parse it into a map of vectors. So, the keys of the map are the column names from the CSV and the values of the map are vectors containing the columns of values from the CSV.

I use the built in clojure.data.csv for reading the file and even though the CSV file (found here) is only 32 MB, my code runs really quite slow.

(require '[clojure.data.csv :as csv]
         '[clojure.java.io :as io])

(defn csv->df [file-path]
  (with-open [reader (io/reader file-path)]
    (let [in-file (csv/read-csv reader)
          names (first in-file)
          data (rest in-file)]
      (zipmap (map keyword names) (apply mapv vector data)))))

(csv->df "data/flights.csv")

I suspect I'm doing something daft related to lazy sequences since I'm still getting to grips with them as a Clojure newb, but I'm unable to identify the root cause of the issue.

Is it possible to restructure this function so that it doesn't run at a glacial pace?


Solution

  • You're not benefitting from laziness, because transposing a matrix (which is what apply map vector is doing) can't be lazy. But you're not doing anything especially wrong. In my testing on my machine, 9 seconds is how long it takes just to call read-csv on this file and iterate over the results. Your function takes about twice as long. So your processing isn't exactly fast, but the best you could hope for would be a 50% speedup for the whole process, if you somehow managed to do all your post-processing in zero time.

    I think these costs are just inevitable if you use clojure.data.csv. It's very convenient to have stuff packaged up into a nice tidy Clojure data structure for you, but it doesn't come for free. I tried FastCSV for comparison: it's around 20 times faster to read-and-discard the whole file, and similarly fast to produce a vector-of-vectors from the file. The transposition is still slow, but the whole process clocks in at 5s instead of 17s for your function. Here's what I wrote: inflexible and clunky, but it produces the same result and is simple enough. If performance really matters, you can fuse the transposition step, by never building a vector of rows to begin with, instead maintaining a separate vector for each column, and updating each of those for each row you encounter.

    (defmacro row-fn [[arg] & body]
      (let [x (gensym 'arg)]
        `(reify java.util.function.Consumer
           (accept [this ~x]
             (let [~(with-meta arg {:tag `CsvRow}) ~x]
               ~@body)))))
    
    (defn fastcsv->df [file-path]
      (let [headers (promise)
            rows (atom (transient []))]
            csv (-> (CsvReader/builder)
                    (.build (io/reader file-path))
                    (.spliterator))
        (.tryAdvance csv (row-fn [row]
                           (deliver headers (map keyword (.getFields row)))))
        (.forEachRemaining csv (row-fn [row]
                                 (swap! rows conj! (.getFields row))))
        (zipmap @headers (apply map vector (persistent! @rows)))))