sqlpostgresqljoinsql-order-by

Postgres inner join where and order by


Is there a way to efficiently order by a text field in a join query (i.e., fast on large datasets)? I understand that I probably need to filter the dataset somehow, to reduce the size of the dataset during sorting, but I haven't yet figured out how to do it. I need to output the query results in ascending order.

The "values" table has 27,515,151 records (id and text value), and the "entity_values" table (id, entity_id, value_id, attribute_id) has 31,875,250 records. When I join these tables, it takes 40 seconds to retrieve the first results ordered by value. However, without sorting, the query runs quickly. However, when I order by descending, the query executes quickly.

I have indexes on the values table (id, value) and on the entity_values table (attribute_id, value_id). Please, help me

SELECT v.value
FROM entity_values ev
JOIN values v ON v.id = ev.value_id
WHERE ev.attribute_id = 5
ORDER BY v.value
LIMIT 10

Here is also the execution plan

Limit  (cost=1001.02..2149.62 rows=10 width=4) (actual time=67078.933..67079.005 rows=10 loops=1)
  Output: v.value
  Buffers: shared hit=110035024 read=21958968
  ->  Gather Merge  (cost=1001.02..916271399.57 rows=7977313 width=4) (actual time=67078.932..67079.000 rows=10 loops=1)
        Output: v.value
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=110035024 read=21958968
        ->  Nested Loop  (cost=1.00..915349619.69 rows=3323880 width=4) (actual time=39414.285..39414.287 rows=7 loops=3)
              Output: v.value
              Buffers: shared hit=110035024 read=21958968
              Worker 0: actual time=36252.205..36252.208 rows=10 loops=1
                Buffers: shared hit=33818981 read=6259066
              Worker 1: actual time=14912.457..14912.459 rows=10 loops=1
                Buffers: shared hit=9003723 read=1604262
              ->  Parallel Index Scan using values_value_index on public.values v  (cost=0.44..8957752.78 rows=11464647 width=12) (actual time=0.836..24843.321 rows=9168174 loops=3)
                    Output: v.value, v.id
                    Buffers: shared hit=16935 read=21958955
                    Worker 0: actual time=0.428..22411.536 rows=8453050 loops=1
                      Buffers: shared hit=6779 read=6259063
                    Worker 1: actual time=0.665..10883.462 rows=2250302 loops=1
                      Buffers: shared hit=2513 read=1604259
              ->  Index Only Scan using index_attributeid_and_valueid on public.entity_values ev  (cost=0.56..76.38 rows=268 width=8) (actual time=0.001..0.001 rows=0 loops=27504522)
                    Output: ev.attribute_id, ev.value_id
                    Index Cond: ((ev.attribute_id = 5) AND (ev.value_id = v.id))
                    Heap Fetches: 21
                    Buffers: shared hit=110018089 read=13
                    Worker 0: actual time=0.001..0.001 rows=0 loops=8453050
                      Buffers: shared hit=33812202 read=3
                    Worker 1: actual time=0.002..0.002 rows=0 loops=2250302
                      Buffers: shared hit=9001210 read=3
Planning Time: 2.337 ms
Execution Time: 67079.040 ms

Solution

  • The question is:

    Is there a way to efficiently order by a text field in a join query...?

    You are joining two tables, where you are filtering (equality) on one of them and sorting on the other one.

    A straightforward solution is what PostgreSQL is doing now, that is, to reading one table (the driving table) and use a nested loop to read and filter the secondary table. This can work well for small tables but won't scale well for large or super large tables.

    I can think of two generic solutions to keep high performance on a query like this one:

    1. Multi-table Index. Unfortunately this functionality is not available in PostgreSQL; I've only seen them in Oracle so far. This index will be able to filter and sort in one simple index scan. Since you are using PostgreSQL this won't work.

    2. Redundancy. If you add redundancy to the secondary table and index it, you'll be able to query it without even joining the primary one. That is:

    create table values (
      id int primary key not null,
      value int,
      attribute_id int -- this is a redundant value copied from the entity value.
    );
    
    create index ix1 on values (attribute_id, value);
    

    You'll need to add a couple of triggers to keep attribute_id always up to date, when a row is inserted or updated in the table entity_values. I leave that as an exercise to you.

    With this redundancy in place, now the query becomes much simpler since it can work on a single table. It should look like:

    SELECT value
    FROM values
    WHERE attribute_id = 5
    ORDER value
    LIMIT 10
    

    Finally, adding redundancy to the database comes with its own problems, so the decision needs to be taken carefully. There's always the chance that values loose their synchronization. Also, triggers come with their own problems as well; typically is quite difficult to version them, or to keep them in the SCM.