mongodbfinancial

How to aggregate OHLC bars with mongodb?


I have 1 min OHLC bars on some market data that I imported into Mongo DB.

Each document looks like this:

{
    "_id" : ObjectId("5ac3163f31a0632c7642ca1c"),
    "Date" : "08/06/2007",
    "Time" : "15:01",
    "Open" : 1310,
    "High" : 1310.25,
    "Low" : 1309.5,
    "Close" : 1310,
    "Up" : 209,
    "Down" : 165,
    "Volume" : 0
}

I want to build a function that allows me to quickly generate X-bar intervals from this data. i.e. generate output 5-min bars, 1 hr bars, daily bars, etc... I also want to be able to filter out a data range.

I've been playing around with Mongo's aggregation functions, but I'm getting overwhelmed how I should approach this AND how I should order the pipeline operations.

Do I first group by 'Date', then sort by 'Time', then group again by $first, $last, $max and $min?

Or do I first create a new field somehow combining 'Date' and 'Time' and then proceed to the grouping?

Although don't I need to first somehow convert the "Date" and "Time" fields from string to Date field so that Mongo knows how to sort and match properly? ...but then which order would I do that in?

I'm still a newbie to MongoDB, so any advice would be appreciated.


Solution

  • Ok, I've come up with a solution:

    db.minbars.aggregate([
       {
          $project: 
          {
             dts: 
             {
                $dateFromString: 
                {
                   dateString: 
                   {
                      $concat: ['$Date', '$Time']
                   }
                }
             },
             Open:1, 
             High:1, 
             Low:1, 
             Close:1
          }   
       },
       {
          $match: 
          {
             dts: 
             { 
                $gte: ISODate("2016-01-01T00:00:00.000Z"), 
                $lte: ISODate("2016-12-31T00:00:00.000Z")
             }
          }
       },
       {
          $sort: { dts : 1 }
       },
       {
          $group:
          {
             _id: 
             {
                year: {$year: "$dts"},
                month: {$month: "$dts"},
                day: {$dayOfMonth: "$dts"},               
                hour: {$hour: "$dts"},
                min: 
                {
                    $add: 
                    [
                       {$subtract:
                       [
                          {$minute: "$dts"},
                          {$mod: [{$minute: "$dts"}, 5]}
                       ]},
                       5   
                    ]
                }   
             },
             Open: {$first: "$Open"},
             High: {$max: "$High"},
             Low: {$min: "$Low"},
             Close: {$last: "$Close"}
          }
       } 
    ], {allowDiskUse: true})
    

    Here's an explanation for each pipeline stage:

    1. Project

    Use 'dateFromString' from combine 'Date' and 'Time' into an ISODate object ('dts' - which stands for date timestamp). Preserve the other OHLC fields.

    1. Match

    Filter out based on a date range

    1. Sort

    Sort by the new ISODate object ('dts').

    1. Group

    Group together all those documents with the same Year, Month, Day, Hour, and 5-minute minute interval. The minute interval uses the formula: minute = minuteIn - (minuteIn % i) + i, where i=minute interval. I'm adding 'i' so that minutes 00, 01, 02, 03, and 04 are aggregated to the next 05 minute interval (and not the preceding 00 minute interval). NOTE: if you want 1-hr, 4-hr, Daily bars, etc...then you need to adjust the _id section accordingly.

    NOTE: I'm using {allowDiskUse: true} here because at one point I ran into the Memory constraints at the Sort stage.

    Maybe someone can come up with a simpler way to do this?


    UPDATE:

    As I noted in 4) above, I mentioned I was adding "i" (minute interval) to the resulting minute. However, when I did this I ended up with a '60' minute interval showing up in the output. You should only have 0, 5, 10, 15, ...55 minute bars and there should NOT be a 60-minute bar. So this was NOT correct.

    Also, if you compare against a trading platform (i.e. Thinkorswim) you can see that the standard practice is to use preceding 5-minute interval for the timestamp of the bar. For example, the 5-minute bar 9:25 represents the aggregation of these minute bars: 9:25, 9:26, 9:27, 9:28, 9:29.