postgresqllateral

Postgresql - LEFT JOIN LATERAL is too slow than subquery


I have some difficulty using 'LEFT JOIN LATERAL' function with postgresql 9.5.

In my table, there are three columns of 'ID', 'DATE', 'CODE'. One person (ID) have multiple rows as below. The number of ID is 362 and total row number is about 2500000.

ID   /  DATE     / CODE
1    /  20020101 / drugA
1    /  20020102 / drugA
1    /  20020103 / drugB
1    /  20020104 / drugA
1    /  20020105 / drugA
1    /  20020106 / drugB
1    /  20020107 / drugA
2    /  ...      / ...

I need to summarize information of drug A used between the first day and last day of drugB.

In the above case, only two rows should be remained for ID (1) [between 20020103 ~ 20020106; the period of drugB].

1    /  20020104 / drugA
1    /  20020105 / drugA

To take this job, I write SQL code using 'LEFT LATERAL JOIN' as below.

SELECT * FROM (SELECT ID, min(DATE) as start_date, max(DATE) as end_date from MAIN_TABLE WHERE CODE = 'drugA' GROUP BY ID) AA
LEFT JOIN LATERAL (SELECT ID, COUNT(ID) as no_tx, min(DATE) as fday_tx, max(DATE) lday_tx from MAIN_TABLE WHERE CODE = 'drugB' AND DATE > AA.start_date AND DATE < AA.end_date GROUP BY ID) as BB USING(ID);

There are only 362 person ID, but this postgresql code take about 2 mins.

It's too slow. Therefore, I tried another SQL code using subquery.

SELECT * FROM (SELECT ID, min(DATE) as start_date, max(DATE) as end_date from MAIN_TABLE WHERE CODE ='drugA' GROUP BY ID) AA
LEFT JOIN (
       SELECT ID, COUNT(ID) as no_tx, min(DATE) as fday_tx, max(DATE) lday_tx FROM (SELECT ID, DATE, CODE FROM MAIN_TABLE) BB
            LEFT JOIN (SELECT ID, min(DATE) as start_date, max(DATE) as end_date from MAIN_TABLE WHERE CODE ='drugA' GROUP BY ID) CC USING (ID)
            WHERE CODE = 'drugB' and DATE > start_date and DATE < end_date GROUP BY ID
            ) DD USING (ID);

This code is not simple but very fast (take only 1.6 sec).

When I compare the explain of two codes, the second code use hash join, but the first code do not.

Can I get some hint to improve the first code with 'LEFT LATERAL JOIN' function more efficiently?


Solution

  • Why not just use a join and group by?

    SELECT AA.ID, COUNT(B.ID) as no_tx, min(B.DATE) as fday_tx, max(B.DATE) as lday_tx,
           AA.start_date, AA.end_date
    FROM (SELECT ID, min(DATE) as start_date, max(DATE) as end_date 
          FROM MAIN_TABLE
          WHERE CODE = 'drugA'
          GROUP BY ID
         ) AA LEFT JOIN
         MAIN_TABLE b
         ON b.CODE = 'drugB' AND b.DATE > AA.start_date AND b.DATE < AA.end_date
    GROUP BY AA.ID,  AA.start_date, AA.end_date;
    

    Or, perhaps more efficiently, window functions:

    SELECT ID, SUM(CASE WHEN code = 'drugB' THEN 1 ELSE 0 END) as no_tx,
           MIN(CASE WHEN code = 'drugB' THEN DATE END) as fday_tx,
           MIN(CASE WHEN code = 'drugB' THEN DATE END) as lday_tx,
           start_date, end_date
    FROM (SELECT t.*,
                 MIN(CASE WHEN code = 'drugA' THEN date END) as start_date,
                 MAX(CASE WHEN code = 'drugB' THEN date END) as end_date
          FROM MAIN_TABLE t
         ) t
    WHERE code in ('drugA', 'drugB') AND
          date between start_date and end_date
    GROUP BY t.id;