I have 2 tables related in a one-to-many relationship:
CREATE TABLE main
(
id INTEGER,
filter_c TEXT
);
INSERT INTO main (id, filter_c)
VALUES (1, 'data1'),
(2, 'data2');
CREATE TABLE feature
(
main_id INTEGER,
mark_c TEXT
);
INSERT INTO feature (main_id, mark_c)
VALUES (1, 'mark1'),
(1, 'mark1'),
(1, 'mark2'),
(1, null);
Indexes are created. Table sizes are 7k and 2m records. Expected growth by 3-4 times.
Statistics aggregate:
SELECT
main_id, SUM(amount) AS total,
mark_c AS best, MAX(marked) AS goods,
SUM(marked) - MAX(marked) AS bads
FROM
(SELECT
main_id, mark_c,
COUNT(*) AS amount, --include Null
COUNT(mark_c) AS marked --exclude Null
FROM
feature
GROUP BY
main_id, mark_c)
GROUP BY
main_id
The execution time for the whole table is ~7 seconds. But this is almost never needed. Up to 10 records will be filtered from main
, for which statistics are needed. The expected way of obtaining it is...
SELECT
id, filter_c,
(SELECT
main_id,
SUM(amount) AS total, mark_c AS best,
MAX(marked) AS goods, SUM(marked) - MAX(marked) AS bads
FROM
(SELECT
main_id, mark_c,
COUNT(*) AS amount, --include Null
COUNT(mark_c) AS marked --exclude Null
FROM
feature
WHERE
main_id = id --PRE-FILTER
GROUP BY
main_id, mark_c)
GROUP BY
main_id)
FROM
main
WHERE
filter_c = 'data1'
But you can't return multiple fields in such a subquery.
What other ways are there to collect statistics only for the records you need?
You can do the filtering inside the subquery. Add a WHERE
to the inner query to restrict main_id
to records that have the filter_c
value you need:
(SELECT
main_id, mark_c,
COUNT(*) AS amount, --include Null
COUNT(mark_c) AS marked --exclude Null
FROM
feature
WHERE
main_id IN (SELECT id FROM main WHERE filter_c = 'data1') -- <- this line
GROUP BY
main_id, mark_c)