sqlsqlite

Multi-column subquery in select part or other way to pre-filter a heavy aggregate


I have 2 tables related in a one-to-many relationship:

CREATE TABLE main 
(
    id INTEGER,
    filter_c TEXT
);

INSERT INTO main (id, filter_c) 
VALUES (1, 'data1'),
       (2, 'data2');
 
CREATE TABLE feature 
(
    main_id INTEGER,
    mark_c TEXT
);

INSERT INTO feature (main_id, mark_c) 
VALUES (1, 'mark1'),
       (1, 'mark1'),
       (1, 'mark2'),
       (1, null);

Indexes are created. Table sizes are 7k and 2m records. Expected growth by 3-4 times.

Statistics aggregate:

SELECT 
    main_id, SUM(amount) AS total, 
    mark_c AS best, MAX(marked) AS goods, 
    SUM(marked) - MAX(marked) AS bads
FROM 
    (SELECT 
         main_id, mark_c,
         COUNT(*) AS amount, --include Null
         COUNT(mark_c) AS marked --exclude Null
     FROM 
         feature
     GROUP BY 
         main_id, mark_c)
GROUP BY 
    main_id

The execution time for the whole table is ~7 seconds. But this is almost never needed. Up to 10 records will be filtered from main, for which statistics are needed. The expected way of obtaining it is...

SELECT 
    id, filter_c,
    (SELECT 
         main_id, 
         SUM(amount) AS total, mark_c AS best, 
         MAX(marked) AS goods, SUM(marked) - MAX(marked) AS bads
     FROM  
         (SELECT 
              main_id, mark_c,
              COUNT(*) AS amount, --include Null
              COUNT(mark_c) AS marked --exclude Null
          FROM 
              feature
          WHERE 
              main_id = id --PRE-FILTER
          GROUP BY 
              main_id, mark_c)
     GROUP BY 
         main_id)
FROM 
    main
WHERE 
    filter_c = 'data1'    

But you can't return multiple fields in such a subquery.

What other ways are there to collect statistics only for the records you need?


Solution

  • You can do the filtering inside the subquery. Add a WHERE to the inner query to restrict main_id to records that have the filter_c value you need:

        (SELECT 
             main_id, mark_c,
             COUNT(*) AS amount, --include Null
             COUNT(mark_c) AS marked --exclude Null
         FROM 
             feature
         WHERE
             main_id IN (SELECT id FROM main WHERE filter_c = 'data1') -- <- this line
         GROUP BY 
             main_id, mark_c)