sqlpostgresqlwindow-functionsaggregate-filter

Compute a running ratio of two totals


I have a PostgreSQL 9.4.1 database (Retrosheet data) with a table events containing one row per baseball play. I want to a compute a running batting average for a given player: the formula is (total number of hits so far)/(total number of valid at-bats so far).

I can use window functions to get a running total of hits for David Ortiz, whose player code is ortid001, using the following query:

SELECT count(*) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
FROM events WHERE bat_id='ortid001' AND (event_cd='20' OR event_cd='21' 
OR event_cd='22' OR event_cd='23');

(The clause involving event_cd just identifies which rows are considered hits.)

Using the same technique, I can get a running total of at-bats (the event_cd clause rejects every row that doesn't count as an at-bat. note that the hits selected above are a subset of at-bats):

SELECT count(*) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
FROM events WHERE bat_id='ortid001' AND (event_cd != '11' AND 
event_cd!='14' AND event_cd!='15' AND event_cd!='16' AND     
event_cd!='17');

How can I combine these? Ideally, for each row describing a play with bat_id='some_player_id', I would compute two functions: the count of all preceding rows describing an at-bat, and the count of all preceding rows describing hits. Dividing these gives the running batting average at that row.


Solution

  • Assuming (since it hasn't been declared) event_cd is data type integer and can be NULL.

    SELECT *, round(hit::numeric / at_bat, 2) AS rate
    FROM  (
       SELECT input_ts
            , count(*) FILTER (WHERE event_cd = ANY ('{20,21,22,23}'::int[]))
                       OVER (ORDER BY input_ts) AS hit
            , count(*) FILTER (WHERE NOT (event_cd = ANY ('{11,14,15,16,17}'::int[]))) 
                       OVER (ORDER BY input_ts) AS at_bat
       FROM   events
       WHERE  bat_id = 'ortid001'
       ) sub
    ORDER  BY input_ts;
    

    Since you are using pg 9.4, you can use the new aggregate FILTER clause. Related answer:

    The frame definition ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is the default, so you don't have to declare it.

    But there is no "natural order" in a database table. Don't confuse this with a spreadsheet. You need to define it with ORDER BY. I am using the imaginary column input_ts, replace it with a (list of) column(s) that defines your sort order. More:

    I avoid NOT IN, because it exhibits tricky behavior with NULL values.

    The cast to numeric is to avoid integer division which would cut off fractional digits and lead to questionable usefulness. Rounding the result to two fractional digits.