I would like to know whether multiple similar exists conditions can be combined in a meaningful and performant way.
Let us assume the following example: Different activities can be assigned to a service (n-m). Activities can be grouped independently into activity groups. Activity groups can be assigned to a group type.
If I now want to find all services that have a reference to certain group types and I want to link the condition by OR
, then this is relatively simple by combining EXISTS
and IN
.
select *
from service
where exists (
select 1
from activitiy
join activitiy_activitiy_group
on activitiy.id = activitiy_activitiy_group.id_activitiy
join activitiy_group
on activitiy_activitiy_group.id_activitiy_group = activitiy_group.id
where (
activitiy_group.id_type in (1, 3)
and activitiy.id_service = service.id
);
If, on the other hand, I want to link the condition by AND, then it is not quite so simple. I could add multiple EXITS conditions:
select *
from service
where exists (
select 1
from activitiy
join activitiy_activitiy_group
on activitiy.id = activitiy_activitiy_group.id_activitiy
join activitiy_group
on activitiy_activitiy_group.id_activitiy_group = activitiy_group.id
where (
activitiy_group.id_type = 1
and activitiy.id_service = service.id
)
and
exists (
select 1
from activitiy
join activitiy_activitiy_group
on activitiy.id = activitiy_activitiy_group.id_activitiy
join activitiy_group
on activitiy_activitiy_group.id_activitiy_group = activitiy_group.id
where (
activitiy_group.id_type = 3
and activitiy.id_service = service.id
);
But I wonder if this approach is performant for many filter elements. I experimented a bit and one approach would be with only one subselect, by selecting all distinct activity group type ids related to a service into one array and comparing it with the filter values:
select *
from service
where true =
(select ARRAY_AGG(activitiy_group.id_type) @> ('{1,3}'::Integer[])
from activitiy
join activitiy_activitiy_group
on activitiy.id = activitiy_activitiy_group.id_activitiy
join activitiy_group
on activitiy_activitiy_group.id_activitiy_group = activitiy_group.id
where ativitiy.id_service = service.id);
But here, too, the question arises as to whether this is really performant. Can anyone assess this, or is there perhaps a more sensible alternative approach? I assumed that the underlying basic problem is a standard problem, but unfortunately could not find any other approach on the internet.
I'll pick a setting where the involved tables are big and writes are comparatively rare. Then it makes sense to create an auxiliary MATERIALIZED VIEW
(once) to dramatically speed up the query:
CREATE MATERIALIZED VIEW service_activity_types AS
SELECT a.id_service, array_agg(ag.id_type) AS activity_types
FROM (
SELECT DISTINCT a.id_service, ag.id_type
FROM activitiy a
JOIN activitiy_activitiy_group aag ON aag.id_activitiy = a.id
JOIN activitiy_group ag ON ag.id = aag.id_activitiy_group
ORDER BY 1, 2
) sub
GROUP BY 1;
Generates a table with unique services with an array of unique activity types.
(It's typically faster to apply DISTINCT
and ORDER BY
once in a subquery.)
Create a unique index on service_activity_types
to allow refreshing the MV CONCURRENTLY
:
CREATE UNIQUE INDEX service_activity_type_uni ON service_activity_types (id_service);
Refresh after impactful changes to underlying tables:
REFRESH MATERIALIZED VIEW CONCURRENTLY service_activity_types;
Create a index on the array column to make the query fast. There are various options. For your case, I expect a GIN index using the operator class gin__int_ops
from the additional module intarray
to be fastest. Install the module once per database first. See:
CREATE INDEX service_activity_type_gin_idx ON service_activity_types USING gin (activity_types gin__int_ops);
Maybe even a multicolumn index. See:
Also, to get your column statistics started:
ANALYZE service_activity_types;
Then your query can be:
SELECT id_service
FROM service_activity_types
WHERE activity_types @> '{1,3}';
And it will be blazingly fast.