sqlpostgresqlperformancedatatablequery-optimization

Performance of SQL query in a big table


I have a postgres table looking like the following:

id: int [PK]
scenario_id: int [FK]
node_id: int Optional[FK]
element_id: int Optional[FK]
result: char(20)
value: double
unit: char(12)

Where I save rows for various result for both nodes and elements of my graph. If a result variable lets say x is only valid for a node, then the row looks like following:

id    scenario_id    node_id    element_id    result    value    unit
1     1              100        [null]        x         0.1      'MW'

I realized, after adding different scenarios, this table becomes really big (40m entries in 5000 scenarios) really fast. And my query for getting all the results of 1 scenario takes 7secs average according to pgadmin.

How could I make this table more efficient, so that the query does not take 7 seconds?


QUERY:

SELECT * FROM simulation.scenario_results t
    WHERE t.scenario_id = 5000

Explain:

pgadmin

DLL SCRIPT:

-- Table: simulation.scenario_results

-- DROP TABLE IF EXISTS simulation.scenario_results;

CREATE TABLE IF NOT EXISTS simulation.scenario_results
(
    id integer NOT NULL DEFAULT nextval('simulation.scenario_results_id_seq'::regclass),
    scenario_id integer NOT NULL,
    node_id integer,
    element_id integer,
    result character varying(20) COLLATE pg_catalog."default" NOT NULL,
    value double precision NOT NULL,
    unit character varying(12) COLLATE pg_catalog."default",
    CONSTRAINT scenario_results_pkey PRIMARY KEY (id),
    CONSTRAINT scenario_results_element_id_fkey FOREIGN KEY (element_id)
        REFERENCES simulation.elements (id) MATCH SIMPLE
        ON UPDATE NO ACTION
        ON DELETE CASCADE,
    CONSTRAINT scenario_results_node_id_fkey FOREIGN KEY (node_id)
        REFERENCES simulation.nodes (id) MATCH SIMPLE
        ON UPDATE NO ACTION
        ON DELETE CASCADE,
    CONSTRAINT scenario_results_scenario_id_fkey FOREIGN KEY (scenario_id)
        REFERENCES simulation.scenarios (id) MATCH SIMPLE
        ON UPDATE NO ACTION
        ON DELETE CASCADE
)

TABLESPACE pg_default;

ALTER TABLE IF EXISTS simulation.scenario_results
    OWNER to postgres;

Solution

  • It looks like the only index you have is a unique that comes with a primary key. Even a simple, default btree on the column you're filtering on would speed things up:

    create index on simulation.scenario_results(scenario_id);
    

    Don't forget to vacuum analyze simulation.scenario_results; afterwards, then test your query again to see your seq scan replaced by a much faster index scan. The former means your queries had to read the whole thing each time to find your specific scenario, the latter means it now checks which parts of the table it can skip completely and where to find the specific scenario_id.

    In this test on 300k random rows, the parallel seq scan needed 90ms, while the index scan went below 1ms.

    You can speed this up further in two ways: either make the index a covering index, meaning that all data you might want from the table will be right in it, so a query will be able to read everything straight from the index without ever having to visit the table, which is an index-only scan, which in the test went down to 0.3ms:

    create index on simulation.scenario_results(scenario_id)
       include(id,node_id,element_id,result,value,unit);
    

    Or you can cluster the table to be arranged in a way that speeds up the jumps from a regular, lightweight index, which only took 0.4ms, without having to duplicate the data between the table and the index:

    create index idx1 on simulation.scenario_results(scenario_id);
    cluster verbose simulation.scenario_results using idx1;
    analyze simulation.scenario_results ;