pythondatabasepostgresqlleft-joingenerate-series

How to return rows with 0 count for missing data?


I have unevenly distributed data (wrt date) for a few years (2003-2008). I want to query data for a given set of start and end date, grouping the data by any of the supported intervals (day, week, month, quarter, year) in PostgreSQL 8.3.

The problem is that some of the queries give results continuous over the required period, as this one:

select to_char(date_trunc('month',date), 'YYYY-MM-DD'), count(distinct post_id) 
from some_table
where category_id = 1
and entity_id = 77
and entity2_id = 115 
and date <= '2008-12-06'
and date >= '2007-12-01'
group by date_trunc('month',date)
order by date_trunc('month',date);

  to_char   | count 
------------+-------
 2007-12-01 |    64
 2008-01-01 |    31
 2008-02-01 |    14
 2008-03-01 |    21
 2008-04-01 |    28
 2008-05-01 |    44
 2008-06-01 |   100
 2008-07-01 |    72
 2008-08-01 |    91
 2008-09-01 |    92
 2008-10-01 |    79
 2008-11-01 |    65
(12 rows)

But some of them miss some intervals because there is no data present, as this one:

select to_char(date_trunc('month',date), 'YYYY-MM-DD'), count(distinct post_id) 
from some_table
where category_id=1
and entity_id = 75
and entity2_id = 115 
and date <= '2008-12-06'
and date >= '2007-12-01'
group by date_trunc('month',date)
order by date_trunc('month',date);
    
  to_char   | count 
------------+-------

 2007-12-01 |     2
 2008-01-01 |     2
 2008-03-01 |     1
 2008-04-01 |     2
 2008-06-01 |     1
 2008-08-01 |     3
 2008-10-01 |     2
(7 rows)

where the required resultset is:

  to_char   | count 
------------+-------
 2007-12-01 |     2
 2008-01-01 |     2
 2008-02-01 |     0
 2008-03-01 |     1
 2008-04-01 |     2
 2008-05-01 |     0
 2008-06-01 |     1
 2008-07-01 |     0
 2008-08-01 |     3
 2008-09-01 |     0
 2008-10-01 |     2
 2008-11-01 |     0
(12 rows)

A count of 0 for missing entries.

I have seen earlier discussions on Stack Overflow but they don't solve my problem it seems, since my grouping period is one of (day, week, month, quarter, year) and decided on runtime by the application. So an approach like left join with a calendar table or sequence table will not help I guess.

My current solution to this is to fill in these gaps in Python (in a Turbogears App) using the calendar module.

Is there a better way to do this?


Solution

  • You can create the list of all first days of the last year (say) with

    select distinct date_trunc('month', (current_date - offs)) as date 
    from generate_series(0,365,28) as offs;
              date
    ------------------------
     2007-12-01 00:00:00+01
     2008-01-01 00:00:00+01
     2008-02-01 00:00:00+01
     2008-03-01 00:00:00+01
     2008-04-01 00:00:00+02
     2008-05-01 00:00:00+02
     2008-06-01 00:00:00+02
     2008-07-01 00:00:00+02
     2008-08-01 00:00:00+02
     2008-09-01 00:00:00+02
     2008-10-01 00:00:00+02
     2008-11-01 00:00:00+01
     2008-12-01 00:00:00+01
    

    Then you can join with that series.