sqlpostgresqlexistspostgresql-performancerelational-division

Combine similar EXISTS conditions in PostgreSQL


I would like to know whether multiple similar exists conditions can be combined in a meaningful and performant way.

Let us assume the following example: Different activities can be assigned to a service (n-m). Activities can be grouped independently into activity groups. Activity groups can be assigned to a group type.

If I now want to find all services that have a reference to certain group types and I want to link the condition by OR, then this is relatively simple by combining EXISTS and IN.

select *
from service
where exists (
          select 1
          from activitiy
               join activitiy_activitiy_group
                    on activitiy.id = activitiy_activitiy_group.id_activitiy
                    join activitiy_group
                        on activitiy_activitiy_group.id_activitiy_group = activitiy_group.id
          where (
                        activitiy_group.id_type in (1, 3)
                        and activitiy.id_service = service.id
                    );

If, on the other hand, I want to link the condition by AND, then it is not quite so simple. I could add multiple EXITS conditions:

select *
from service
where exists (
          select 1
          from activitiy
               join activitiy_activitiy_group
                    on activitiy.id = activitiy_activitiy_group.id_activitiy
                    join activitiy_group
                        on activitiy_activitiy_group.id_activitiy_group = activitiy_group.id
          where (
                        activitiy_group.id_type = 1
                        and activitiy.id_service = service.id
                    )
and
exists (
          select 1
          from activitiy
               join activitiy_activitiy_group
                    on activitiy.id = activitiy_activitiy_group.id_activitiy
                    join activitiy_group
                        on activitiy_activitiy_group.id_activitiy_group = activitiy_group.id
          where (
                            activitiy_group.id_type = 3
                        and activitiy.id_service = service.id
                    );

But I wonder if this approach is performant for many filter elements. I experimented a bit and one approach would be with only one subselect, by selecting all distinct activity group type ids related to a service into one array and comparing it with the filter values:

select *
from service
where true =
      (select ARRAY_AGG(activitiy_group.id_type) @> ('{1,3}'::Integer[])
       from activitiy
               join activitiy_activitiy_group
                    on activitiy.id = activitiy_activitiy_group.id_activitiy
                    join activitiy_group
                        on activitiy_activitiy_group.id_activitiy_group = activitiy_group.id
       where ativitiy.id_service = service.id);

But here, too, the question arises as to whether this is really performant. Can anyone assess this, or is there perhaps a more sensible alternative approach? I assumed that the underlying basic problem is a standard problem, but unfortunately could not find any other approach on the internet.


Solution

  • I'll pick a setting where the involved tables are big and writes are comparatively rare. Then it makes sense to create an auxiliary MATERIALIZED VIEW (once) to dramatically speed up the query:

    CREATE MATERIALIZED VIEW service_activity_types AS
    SELECT a.id_service, array_agg(ag.id_type) AS activity_types
    FROM  (
       SELECT DISTINCT a.id_service, ag.id_type
       FROM   activitiy                 a
       JOIN   activitiy_activitiy_group aag ON aag.id_activitiy = a.id
       JOIN   activitiy_group           ag  ON ag.id = aag.id_activitiy_group
       ORDER  BY 1, 2
       ) sub
    GROUP  BY 1;
    

    Generates a table with unique services with an array of unique activity types.

    (It's typically faster to apply DISTINCT and ORDER BY once in a subquery.)

    Create a unique index on service_activity_types to allow refreshing the MV CONCURRENTLY:

    CREATE UNIQUE INDEX service_activity_type_uni ON service_activity_types (id_service);
    

    Refresh after impactful changes to underlying tables:

    REFRESH MATERIALIZED VIEW CONCURRENTLY service_activity_types;
    

    Create a index on the array column to make the query fast. There are various options. For your case, I expect a GIN index using the operator class gin__int_ops from the additional module intarray to be fastest. Install the module once per database first. See:

    CREATE INDEX service_activity_type_gin_idx ON service_activity_types USING gin (activity_types gin__int_ops);
    

    Maybe even a multicolumn index. See:

    Also, to get your column statistics started:

    ANALYZE service_activity_types;
    

    Then your query can be:

    SELECT id_service
    FROM   service_activity_types
    WHERE  activity_types @> '{1,3}';
    

    And it will be blazingly fast.