mongodbclojuremongerclj-time

Is there an efficient way to group records by day according to a specific time zone?


I’m using a MongoDB database to keep track of analytics for an application. I’m writing a Clojure application (using clj-time and Monger) to get data out of the database.

I have a collection containing records like

{"_id": ObjectId(...),
 timestamp: ISODate("2013-06-01T15:18:37Z"),
 device: "04dbf04b6dc0d0a4fd383967b3dc62f50111e07e"}

Each different device represents a different user of my service. What I’d like to do is to find out how many (unique) users I have each day, but with the caveat that I’d like “day” to refer specifically to the US/Central time zone, taking daylight saving into account. (If that weren’t a requirement I think I could just do something like a $group and then a distinct.)

Here’s what I’ve been doing:

(ns analytics.reporting
  (:use [monger.core :only [connect! connect set-db! get-db]]
        monger.operators
        clj-time.core
        clj-time.periodic
        clj-time.format)
  (:require [monger.collection :as mc]))

(defn to-central
  [dt]
  (from-time-zone dt (time-zone-for-id "America/Chicago")))

(defn count-distinct
  [coll]
  (count (distinct coll)))

(defn daily-usage
  [ndays]
  (let [midnights (map to-central
                       (reverse (for [offset (map days (range ndays))]
                                  (minus (to-central (today-at 0 0)) offset))))
        by-day (for [midnight midnights]
                 (mc/find-maps "devices" {:timestamp {$gte midnight $lt (plus midnight (days 1))}}))
        devices-by-day (map #(map :device %) by-day)
        distinct-devices-by-day (map count-distinct devices-by-day)]
    distinct-devices-by-day))

If you can’t read Clojure, this basically says: get a list of the most recent n midnights in the Central time zone, and then run Mongo queries to find all of the records between each successive pair of midnights. Then, count the number of distinct devices within each day.

Here’s what I don’t like about this approach:

  1. Running a separate query for each day (I usually look at 30 days at a time) feels wrong; this is something that should be done on the database side instead of the application side.
  2. Counting the distinct devices should also be done by the database.
  3. My server is set to the UTC time zone, so if it’s after midnight in UTC but before midnight in Central time, the last entry in this list will always be zero. This is easy enough to patch over, but I’d prefer a solution smart enough to prevent it in the first place.
  4. This entire function takes about 500ms to run. That isn’t awful—I’m the only one who runs the query, and only once or twice per day—but it seems like the operation shouldn’t take that long.

Is there a way I can shove more of this logic into the MongoDB query?


Solution

  • As suggested by @WiredPrairie, I ended up just including the Central-time date in each record as I added it to the database. Then I was able to use a trivial $group query to gather the number of records for each date.