I have a PostgreSQL 9.4.1 database (Retrosheet data) with a table events
containing one row per baseball play. I want to a compute a running batting average for a given player: the formula is (total number of hits so far)/(total number of valid at-bats so far).
I can use window functions to get a running total of hits for David Ortiz, whose player code is ortid001
, using the following query:
SELECT count(*) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM events WHERE bat_id='ortid001' AND (event_cd='20' OR event_cd='21'
OR event_cd='22' OR event_cd='23');
(The clause involving event_cd
just identifies which rows are considered hits.)
Using the same technique, I can get a running total of at-bats (the event_cd
clause rejects every row that doesn't count as an at-bat. note that the hits selected above are a subset of at-bats):
SELECT count(*) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM events WHERE bat_id='ortid001' AND (event_cd != '11' AND
event_cd!='14' AND event_cd!='15' AND event_cd!='16' AND
event_cd!='17');
How can I combine these? Ideally, for each row describing a play with bat_id='some_player_id'
, I would compute two functions: the count of all preceding rows describing an at-bat, and the count of all preceding rows describing hits. Dividing these gives the running batting average at that row.
Assuming (since it hasn't been declared) event_cd
is data type integer
and can be NULL.
SELECT *, round(hit::numeric / at_bat, 2) AS rate
FROM (
SELECT input_ts
, count(*) FILTER (WHERE event_cd = ANY ('{20,21,22,23}'::int[]))
OVER (ORDER BY input_ts) AS hit
, count(*) FILTER (WHERE NOT (event_cd = ANY ('{11,14,15,16,17}'::int[])))
OVER (ORDER BY input_ts) AS at_bat
FROM events
WHERE bat_id = 'ortid001'
) sub
ORDER BY input_ts;
Since you are using pg 9.4, you can use the new aggregate FILTER
clause. Related answer:
The frame definition ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
is the default, so you don't have to declare it.
But there is no "natural order" in a database table. Don't confuse this with a spreadsheet. You need to define it with ORDER BY
. I am using the imaginary column input_ts
, replace it with a (list of) column(s) that defines your sort order. More:
I avoid NOT IN
, because it exhibits tricky behavior with NULL values.
The cast to numeric
is to avoid integer division which would cut off fractional digits and lead to questionable usefulness. Rounding the result to two fractional digits.