sqlmonetdbmonetdblite

how to join tables on cases where none of function(a) in b


Say in MonetDB (specifically, the embedded version from the "MonetDBLite" R package) I have a table "events" containing entity ID codes and event start and end dates, of the format:

| id  | start_date  | end_date   |
| 1   | 2010-01-01  | 2010-03-30 |
| 1   | 2010-04-01  | 2010-06-30 |
| 2   | 2018-04-01  | 2018-06-30 |
| ... | ...         | ...        |

The table is approximately 80 million rows of events, attributable to approximately 2.5 million unique entities (ID values). The dates appear to align nicely with calendar quarters, but I haven't thoroughly checked them so assume they can be arbitrary. However, I have at least sense-checked them for end_date > start_date.

I want to produce a table "nonevent_qtrs" listing calendar quarters where an ID has no event recorded, e.g.:

| id  | last_doq   |
| 1   | 2010-09-30 |
| 1   | 2010-12-31 |
| ... | ...        |
| 1   | 2018-06-30 |
| 2   | 2010-03-30 |
| ... | ...        |

(doq = day of quarter)

If the extent of an event spans any days of the quarter (including the first and last dates), then I wish for it to count as having occurred in that quarter.

To help with this, I have produced a "calendar table"; a table of quarters "qtrs", covering the entire span of dates present in "events", and of the format:

| first_doq  | last_doq   |
| 2010-01-01 | 2010-03-30 |
| 2010-04-01 | 2010-06-30 |
| ...        | ...        |

And tried using a non-equi merge like so:

create table nonevents
as select
    id,
    last_doq
from
    events
    full outer join
    qtrs
on
    start_date > last_doq or
    end_date < first_doq
group by
    id,
    last_doq

But this is a) terribly inefficient and b) certainly wrong, since most IDs are listed as being non-eventful for all quarters.

How can I produce the table "nonevent_qtrs" I described, which contains a list of quarters for which each ID had no events?

If it's relevant, the ultimate use-case is to calculate runs of non-events to look at time-till-event analysis and prediction. Feels like run length encoding will be required. If there's a more direct approach than what I've described above then I'm all ears. The only reason I'm focusing on non-event runs to begin with is to try to limit the size of the cross-product. I've also considered producing something like:

| id  | last_doq   | event |
| 1   | 2010-01-31 | 1     |
| ... | ...        | ...   |
| 1   | 2018-06-30 | 0     |
| ... | ...        | ...   |

But although more useful this may not be feasible due to the size of the data involved. A wide format:

| id  | 2010-01-31 | ... | 2018-06-30 |
| 1   | 1          | ... | 0          |
| 2   | 0          | ... | 1          |
| ... | ...        | ... | ...        |

would also be handy, but since MonetDB is column-store I'm not sure whether this is more or less efficient.


Solution

  • Let me assume that you have a table of quarters, with the start date of a quarter and the end date. You really need this if you want the quarters that don't exist. After all, how far back in time or forward in time do you want to go?

    Then, you can generate all id/quarter combinations and filter out the ones that exist:

    select i.id, q.*
    from (select distinct id from events) i cross join
         quarters q left join
         events e
         on e.id = i.id and
            e.start_date <= q.quarter_end and
            e.end_date >= q.quarter_start
    where e.id is null;