postgresqlaggregate-functions

Postgres aggregate is slower than just fetching and aggregating on python min


I have 2 queries

explain analyze
select * from reviews
    where
    product_id = 4922
    and time > now() - interval '2 month'

and

explain analyze
SELECT
    max(reviews)
FROM
    reviews
WHERE
    product_id = 4922
    AND time > now() - interval '2 month';

The first one takes only about 7ms to complete execution but the second one takes about 3.2 seconds. The reviews table contains about 8.7 million rows.

The table already has a composite index on (product_id,time) and the faster query uses it. However, the aggregate query use the (reviews) index instead and has to filter out millions of rows. I am not sure why the aggregate query would use that index. The reviews column isn't even nullable.

Here are the explain analyze results for the respective queries.

1)

Result  (cost=911.01..911.02 rows=1 width=8) (actual time=3483.019..3483.020 rows=1 loops=1)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=0.43..911.01 rows=1 width=8) (actual time=3483.013..3483.014 rows=1 loops=1)
          ->  Index Scan Backward using reviews_reviews_idx on reviews  (cost=0.43..1341280.17 rows=1473 width=8) (actual time=3483.011..3483.012 rows=1 loops=1)
                Index Cond: (reviews IS NOT NULL)
"                Filter: ((product_id = 4922) AND (""time"" > (now() - '2 mons'::interval)))"
                Rows Removed by Filter: 3277989
Planning Time: 0.316 ms
Execution Time: 3483.043 ms
Bitmap Heap Scan on reviews  (cost=39.54..5619.14 rows=1473 width=210) (actual time=1.571..6.689 rows=3610 loops=1)
"  Recheck Cond: ((product_id = 4922) AND (""time"" > (now() - '2 mons'::interval)))"
  Heap Blocks: exact=3582
  ->  Bitmap Index Scan on reviews_product_id_time_idx  (cost=0.00..39.17 rows=1473 width=0) (actual time=0.705..0.706 rows=3610 loops=1)
"        Index Cond: ((product_id = 4922) AND (""time"" > (now() - '2 mons'::interval)))"
Planning Time: 0.239 ms
Execution Time: 7.036 ms

I could probably add a composite index on (product_id,time,reviews) and it would probably be fast but I believe that postgres already has everything it needs.

Update: DDL for reviews as requested

postgres@ubuntu-8gb-sin-1:~$ pg_dump -t reviews --schema-only postgres
--
-- PostgreSQL database dump
--

-- Dumped from database version 16.6 (Ubuntu 16.6-0ubuntu0.24.04.1)
-- Dumped by pg_dump version 16.6 (Ubuntu 16.6-0ubuntu0.24.04.1)

SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SELECT pg_catalog.set_config('search_path', '', false);
SET check_function_bodies = false;
SET xmloption = content;
SET client_min_messages = warning;
SET row_security = off;

SET default_tablespace = '';

SET default_table_access_method = heap;

--
-- Name: reviews; Type: TABLE; Schema: public; Owner: postgres
--

CREATE TABLE public.reviews (
    id bigint NOT NULL,
    daraz_id text NOT NULL,
    rating_score double precision NOT NULL,
    reviews bigint NOT NULL,
    client text NOT NULL,
    href text NOT NULL,
    "time" timestamp with time zone NOT NULL,
    hostcountry text,
    product_id integer
);


ALTER TABLE public.reviews OWNER TO postgres;

--
-- Name: reviews_id_seq; Type: SEQUENCE; Schema: public; Owner: postgres
--

CREATE SEQUENCE public.reviews_id_seq
    START WITH 1
    INCREMENT BY 1
    NO MINVALUE
    NO MAXVALUE
    CACHE 1;


ALTER SEQUENCE public.reviews_id_seq OWNER TO postgres;

--
-- Name: reviews_id_seq; Type: SEQUENCE OWNED BY; Schema: public; Owner: postgres
--

ALTER SEQUENCE public.reviews_id_seq OWNED BY public.reviews.id;


--
-- Name: reviews id; Type: DEFAULT; Schema: public; Owner: postgres
--

ALTER TABLE ONLY public.reviews ALTER COLUMN id SET DEFAULT nextval('public.reviews_id_seq'::regclass);


--
-- Name: reviews reviews_pkey; Type: CONSTRAINT; Schema: public; Owner: postgres
--

ALTER TABLE ONLY public.reviews
    ADD CONSTRAINT reviews_pkey PRIMARY KEY (id);


--
-- Name: reviews_client_idx; Type: INDEX; Schema: public; Owner: postgres
--

CREATE INDEX reviews_client_idx ON public.reviews USING btree (client);


--
-- Name: reviews_daraz_id_idx; Type: INDEX; Schema: public; Owner: postgres
--

CREATE INDEX reviews_daraz_id_idx ON public.reviews USING btree (daraz_id);


--
-- Name: reviews_daraz_id_reviews_idx; Type: INDEX; Schema: public; Owner: postgres
--

CREATE INDEX reviews_daraz_id_reviews_idx ON public.reviews USING btree (daraz_id, reviews);


--
-- Name: reviews_hostcountry_idx; Type: INDEX; Schema: public; Owner: postgres
--

CREATE INDEX reviews_hostcountry_idx ON public.reviews USING btree (hostcountry);


--
-- Name: reviews_product_id_idx; Type: INDEX; Schema: public; Owner: postgres
--

CREATE INDEX reviews_product_id_idx ON public.reviews USING btree (product_id);


--
-- Name: reviews_product_id_time_idx; Type: INDEX; Schema: public; Owner: postgres
--

CREATE INDEX reviews_product_id_time_idx ON public.reviews USING btree (product_id, "time");


--
-- Name: reviews_rating_score_idx; Type: INDEX; Schema: public; Owner: postgres
--

CREATE INDEX reviews_rating_score_idx ON public.reviews USING btree (rating_score);


--
-- Name: reviews_time_idx; Type: INDEX; Schema: public; Owner: postgres
--

CREATE INDEX reviews_time_idx ON public.reviews USING btree ("time");


--
-- Name: reviews_reviews_idx; Type: INDEX; Schema: public; Owner: postgres
--

CREATE INDEX reviews_reviews_idx ON public.reviews USING btree (reviews);


--
-- Name: reviews reviews_products_fk; Type: FK CONSTRAINT; Schema: public; Owner: postgres
--

ALTER TABLE ONLY public.reviews
    ADD CONSTRAINT reviews_products_fk FOREIGN KEY (product_id) REFERENCES public.products(id);


--
-- PostgreSQL database dump complete
--

Solution

  • This is a classic problem with ordered index plans. Unfortunately I don't know of a short pithy phrase to describe it, which makes searching for it rather hard.

    The planner thinks that by scanning the index in order by "reviews" DESC it will get to stop after finding the first row which meets the conditions on product_id and time, and that this will happen after scanning only about 1/1473 of the index. But it actually has to scan much more of the index than that (3277989 / tuples in table, that latter of which is unknown to me), because the high end of index is selectively deficient in rows meeting those criteria.

    PostgreSQL has no mechanism for accurately predicting this type of dependency, so no amount of analyzing or improving statistics is going to help.

    I could probably add a composite index on (product_id,time,reviews) and it would probably be fast but I believe that postgres already has everything it needs.

    You could argue that PostgreSQL should have a better way of doing this estimate, but it just doesn't. Not in any version, so you have to pick your battles. The better index would be on (product_id, reviews, "time"). That way it can still read rows in order of reviews, but does so only within the proper value of product_id and therefore doesn't need to filter the wrong product_id out. It would still need to filter out the wrong "time", but I am assuming most of the selectivity is due to product_id, not "time".

    Adding that above index is probably the easiest way of dealing with this (or dropping the index on reviews), but if you don't want to do that you could instead rewrite the query in a way which defeats the use of that index. One way to do this is to add a dummy operation, by taking max(reviews+0) rather than max(reviews).