The Situation (updated):
I have 3 PostgreSQL tables (PostgreSQL Version 14.9) that are used to store sensor data from currently 2670 data points. The data was written to the database (blocks of data for 3 days) with automatically generated insert statements in the form:
INSERT INTO monitoring_data.time_series_numerical (timestamp, property_id, value)
VALUES ('timestamp', 'property_id', 'value')
...
ON CONFLICT (timestamp, property_id) DO NOTHING;
This ultimately led to the database becoming slower and slower to the point that new storage space had to be released so that it could be accessed again. The disk space usage of the 3 tables:
oid | table_schema | table_name | number_of_rows | total_row_bytes | row_bytes | total_bytes | index_bytes | toast_bytes | table_bytes | total | index | table | toast |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
22220 | monitoring_data | time_series | 5.5084934e+08 | 226.11499436726209 | 97.18495597524046 | 124555296768 | 71006248960 | NULL | 53549047808 | 116 GB | 66 GB | 50 GB | NULL |
22230 | monitoring_data | time_series_numerical | 5.0103974e+08 | 242.67533732942616 | 108.90748283270925 | 121589989376 | 67007979520 | NULL | 54582009856 | 113 GB | 62 GB | 51 GB | NULL |
22238 | monitoring_data | time_series_alphabetical | 3.25614e+07 | 255.89714840902735 | 110.47100539466943 | 8332369920 | 4734271488 | 8192 | 3598090240 | 7946 MB | 4515 MB | 3431 MB | 8192 bytes |
The 3 tables via CREATE
script from pgAdmin 4:
CREATE TABLE IF NOT EXISTS monitoring_data.time_series
(
"timestamp" timestamp without time zone NOT NULL,
property_id character varying(255) COLLATE pg_catalog."default" NOT NULL,
CONSTRAINT time_series_pkey PRIMARY KEY ("timestamp", property_id),
CONSTRAINT time_series_property_id_fkey FOREIGN KEY (property_id)
REFERENCES monitoring_data.properties (property_id) MATCH SIMPLE
ON UPDATE CASCADE
ON DELETE RESTRICT
)
varchar(255)
, e.g. c0022:93EA6CCA3B7D1829_AnalogInput1.status.voltage
), which is used as identification for the data pointtime_series_numerical
:
CREATE TABLE IF NOT EXISTS monitoring_data.time_series_numerical
(
-- Inherited from table monitoring_data.time_series: "timestamp" timestamp without time zone NOT NULL,
-- Inherited from table monitoring_data.time_series: property_id character varying(255) COLLATE pg_catalog."default" NOT NULL,
value double precision NOT NULL,
CONSTRAINT time_series_numerical_pkey PRIMARY KEY ("timestamp", property_id),
CONSTRAINT time_series_numerical_timestamp_property_id_fkey FOREIGN KEY ("timestamp", property_id)
REFERENCES monitoring_data.time_series ("timestamp", property_id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
)
INHERITS (monitoring_data.time_series)
time_series_alphabetical
: Same as time_series_numerical
except the value
column
value text COLLATE pg_catalog."default" NOT NULL,
To ensure that the foreign key constraint is fulfilled when writing records to one of the inheriting tables, a trigger was created in conjunction with a function that executes the following function before the INSERT of each record:
BEGIN
INSERT INTO monitoring_data.time_series (timestamp, property_id)
SELECT NEW.timestamp, NEW.property_id
WHERE NOT EXISTS (
SELECT 1 FROM monitoring_data.time_series
WHERE timestamp = NEW.timestamp
AND property_id = NEW.property_id
);
RETURN NEW;
END;
The time_series
table should be used to simplify access to measured values in the underlying tables, regardless of the data type, and to enable all timestamp
and property_id
entries to be stored together and only once.
As I have already noted, this is not working as expected and has resulted in the data being stored twice (once in the time_series
table and once in the respective table for the measurement), and in the time_series
table itself, each entry even exists twice.
Query example:
SELECT *
FROM monitoring_data.time_series_numerical
WHERE property_id = 'c0022:9FF441820C58912B_HeatPump.configuration.flowTemperature' -- Replace <your_property_id> with the actual ID
AND timestamp >= '2024-09-01 00:00:00'
AND timestamp < '2024-09-02 00:00:00'
ORDER BY timestamp;
Result with EXPLAIN (ANALYZE, VERBOSE, BUFFERS, SETTINGS)
:
"Index Scan using time_series_numerical_pkey on monitoring_data.time_series_numerical (cost=0.57..1153041.19 rows=4516 width=72) (actual time=0.516..633099.399 rows=5725 loops=1)"
" Output: ""timestamp"", property_id, value"
" Index Cond: ((time_series_numerical.""timestamp"" >= '2024-09-01 00:00:00'::timestamp without time zone) AND (time_series_numerical.""timestamp"" < '2024-09-02 00:00:00'::timestamp without time zone) AND ((time_series_numerical.property_id)::text = 'c0022:9FF441820C58912B_HeatPump.configuration.flowTemperature'::text))"
" Buffers: shared read=199057"
"Settings: search_path = 'monitoring_data'"
"Planning Time: 0.485 ms"
"JIT:"
" Functions: 2"
" Options: Inlining true, Optimization true, Expressions true, Deforming true"
" Timing: Generation 2.138 ms, Inlining 0.000 ms, Optimization 0.000 ms, Emission 0.000 ms, Total 2.138 ms"
"Execution Time: 633108.973 ms"
Before I begin to implement a solution for this problem, I am seeking advice here, as I am just starting to work with PostgreSQL.
Thoughts on solutions and questions:
I currently have two ideas to tackle the problem, but I would like to know what other options I have and what would make sense or be considered best practice for my use case.
Delete all entries from the time_series
table, remove the foreign key constraints from both inheriting tables, as well as the triggers and functions, so that access via the time_series table to the inheriting tables is still possible, but significantly less storage space is used.
Use partitioning to split the data across multiple tables based on the timestamp in order to speed up queries.
It is still unclear to me why the indexes take up so much space and how I can reduce that.
Additional info:
PostgreSQL Version 14.9; there are no additional indexes besides the primary keys; all indexes are b-tree indexes; there are no dead tuples; autovacuum is enabled; extensions: postgis, fuzzystrmatch, uuid-ossp; it was decided that time_series_numerical and time_series_alphabetical should be separated
The OP's original design icludes the full property name (called property_id
in the post) with each sample. The example name
is 62 characters long, which seems to be a reasonable estimate for the mean name length given the table statistics that were provided.
Repeating the name with each sample is an extremely inefficient use of space.
The following demonstrates an approach that reduces the required space by replacing the property name in each sample with a reference to a properties
table.
This approach results in an order of magnitude improvement in storage efficiency. Some column names have been changed to avoid using PostgreSQL keywords.
Below are the table definitions (time_series_alphabetical
is omitted since it is nearly identical to time_series_numerical
):
CREATE TABLE properties (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
property_name TEXT NOT NULL UNIQUE,
CONSTRAINT properties_updated_at_ck CHECK (created_at <= updated_at)
);
CREATE TABLE time_series_numerical (
sample_timestamp TIMESTAMPTZ,
property_id BIGINT REFERENCES properties (id),
created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
sample_value DOUBLE PRECISION,
CONSTRAINT time_series_numerical_pk PRIMARY KEY (property_id, sample_timestamp),
CONSTRAINT time_series_numerical_updated_at_ck CHECK (created_at <= updated_at)
);
The audit columns (created_at
and updated_at
) can be omitted if there isn't a need or desire to track those attributes. Timestamps are declared as
TIMESTAMPTZ
as recommended at Date/Time storage. The primary key
for time_series_numerical
is (property_id, sample_timestamp)
to facilitate efficient retrieval by property. If retrieval by sample_timestamp
is also common, then an additional index should be added with sample_timestamp
as the first attribute.
The following uses a single SQL statement to insert a set of samples with the ability to automatically add properties that have not previously been recorded:
WITH
samples AS (
SELECT
ts::TIMESTAMPTZ AS sample_timestamp,
property_name,
sample_value
FROM
(
VALUES
('2024-01-01 00:00:00', 'prop1', 1),
('2024-01-01 01:00:00', 'prop1', 2),
('2024-01-01 00:00:00', 'prop2', 3),
('2024-01-01 00:00:00', 'prop4', 4),
('2024-01-01 00:00:00', 'prop3', 5)
) v (ts, property_name, sample_value)
),
new_properties AS (
INSERT INTO properties (property_name)
SELECT DISTINCT property_name
FROM samples
ON CONFLICT DO NOTHING
RETURNING id, property_name
)
INSERT INTO time_series_numerical AS tsn (sample_timestamp, property_id, sample_value)
SELECT s.sample_timestamp,
COALESCE(p.id,
(SELECT np.id
FROM new_properties np
WHERE np.property_name = s.property_name)) AS property_id,
s.sample_value
FROM samples s
LEFT JOIN properties p ON p.property_name = s.property_name
ON CONFLICT (sample_timestamp, property_id)
DO UPDATE SET sample_value = EXCLUDED.sample_value,
updated_at = CURRENT_TIMESTAMP
WHERE tsn.sample_value IS DISTINCT FROM EXCLUDED.sample_value;
The ON CONFLICT
clause updates a previously recorded sample if the value has changed.
Running the following query demonstrates that each property has only been inserted once:
SELECT * FROM properties ORDER BY property_name;
id | created_at | updated_at | property_name |
---|---|---|---|
3 | 2024-09-22 15:16:27.151381-04 | 2024-09-22 15:16:27.151381-04 | prop1 |
1 | 2024-09-22 15:16:27.151381-04 | 2024-09-22 15:16:27.151381-04 | prop2 |
2 | 2024-09-22 15:16:27.151381-04 | 2024-09-22 15:16:27.151381-04 | prop3 |
4 | 2024-09-22 15:16:27.151381-04 | 2024-09-22 15:16:27.151381-04 | prop4 |
Samples along with their associated property name can be retrieved with the following query:
SELECT
tsn.sample_timestamp,
tsn.property_id,
p.property_name,
tsn.sample_value,
tsn.created_at,
tsn.updated_at
FROM
time_series_numerical tsn
JOIN properties p ON p.id = tsn.property_id;
sample_timestamp | property_id | property_name | sample_value | created_at | updated_at |
---|---|---|---|---|---|
2024-01-01 00:00:00-05 | 3 | prop1 | 1 | 2024-09-22 15:16:27.151381-04 | 2024-09-22 15:16:27.151381-04 |
2024-01-01 01:00:00-05 | 3 | prop1 | 2 | 2024-09-22 15:16:27.151381-04 | 2024-09-22 15:16:27.151381-04 |
2024-01-01 00:00:00-05 | 1 | prop2 | 3 | 2024-09-22 15:16:27.151381-04 | 2024-09-22 15:16:27.151381-04 |
2024-01-01 00:00:00-05 | 4 | prop4 | 4 | 2024-09-22 15:16:27.151381-04 | 2024-09-22 15:16:27.151381-04 |
2024-01-01 00:00:00-05 | 2 | prop3 | 5 | 2024-09-22 15:16:27.151381-04 | 2024-09-22 15:16:27.151381-04 |
To demonstrate the improved storage efficiency, run the following block to generate synthetic data useful for exploring the effects of different indexing options and system settings (on my 2017 MacBook Pro, this took ~1 hour to complete):
DO $BLOCK$
<<local>>
DECLARE
num_properties CONSTANT INTEGER := 2670;
property_name_length CONSTANT INTEGER := 50;
sampling_interval CONSTANT INTERVAL := INTERVAL '15' SECOND;
missed_sample_probability CONSTANT DOUBLE PRECISION := 0.006;
sample_date_range CONSTANT TSTZRANGE := TSTZRANGE('2024-09-01'::TIMESTAMP AT TIME ZONE 'UTC',
'2024-09-15'::TIMESTAMP AT TIME ZONE 'UTC');
BEGIN
TRUNCATE TABLE time_series_numerical;
TRUNCATE TABLE properties CASCADE;
PERFORM SETVAL('properties_id_seq', 1, FALSE);
INSERT INTO properties(property_name)
SELECT SUBSTRING(REPEAT(ENCODE(UUID_SEND(GEN_RANDOM_UUID()), 'HEX'),
CEIL(local.property_name_length / 16.0)::INTEGER),
local.property_name_length)
FROM GENERATE_SERIES(1, local.num_properties) gs(n);
INSERT INTO time_series_numerical(sample_timestamp, property_id, sample_value)
SELECT sample_times.sample_timestamp, props.id, RANDOM() * 10000
FROM GENERATE_SERIES(1, local.num_properties) props(id)
CROSS JOIN GENERATE_SERIES(LOWER(local.sample_date_range),
UPPER(local.sample_date_range) - INTERVAL '0.000001' SECOND,
local.sampling_interval) sample_times(sample_timestamp)
WHERE random() > local.missed_sample_probability;
END;
$BLOCK$ language plpgsql;
This resulted in 2670 rows in properties
using ~352 KB for table data and ~440 KB for indexes and 214,017,381 rows in time_series_numerical
using ~13.61 GB for table data and ~8.9 GB for indexes
(removing the audit columns, created_at
and updated_at
, reduces the table size by ~3.2 GB).
Two main factors contributed to the excessively long query time with the OP's original table structure: inefficient index scanning caused by index column order, and excessive block reads caused by low tuple density. Querying for an individual property would be considerably more efficient if the property column appeared first in the index definition since that would reduce the number of block reads required to access all of the related samples. Because the property names are very long, they require significantly more bytes than would be needed by an integer ID reference, which means significantly more block reads (for the OP's case, using an INTEGER
reference would reduce the number of block reads by ~16 and using BIGINT
by ~8). With the above table structure, query performance is improved by several orders of magnitude as demonstrated by the following:
EXPLAIN(ANALYZE,BUFFERS,VERBOSE)
SELECT
tsn.sample_timestamp,
tsn.property_id,
p.property_name,
tsn.sample_value,
tsn.created_at,
tsn.updated_at
FROM
time_series_numerical tsn
JOIN properties p ON p.id = tsn.property_id
WHERE p.property_name = '80c2bc53040057ac1fca90fa61a492e980c2bc53040057ac1fca90fa61a492e980c2bc53040057a'
AND tsn.sample_timestamp >= '2024-09-01'::TIMESTAMP
AND tsn.sample_timestamp < '2024-09-02'::TIMESTAMP;
QUERY PLAN
Nested Loop (cost=196.53..21870.23 rows=5622 width=120) (actual time=3.191..11.325 rows=5723 loops=1)
Output: tsn.sample_timestamp, tsn.property_id, p.property_name, tsn.sample_value, tsn.created_at, tsn.updated_at
Buffers: shared hit=5768 read=2
-> Index Scan using properties_property_name_key on public.properties p (cost=0.28..8.30 rows=1 width=88) (actual time=0.637..0.641 rows=1 loops=1)
Output: p.id, p.created_at, p.updated_at, p.property_name
Index Cond: (p.property_name = '80c2bc53040057ac1fca90fa61a492e980c2bc53040057ac1fca90fa61a492e980c2bc53040057a'::text)
Buffers: shared hit=1 read=2
-> Bitmap Heap Scan on public.time_series_numerical tsn (cost=196.25..21805.71 rows=5622 width=40) (actual time=2.544..7.690 rows=5723 loops=1)
Output: tsn.sample_timestamp, tsn.property_id, tsn.created_at, tsn.updated_at, tsn.sample_value
Recheck Cond: ((tsn.property_id = p.id) AND (tsn.sample_timestamp >= '2024-09-01 00:00:00'::timestamp without time zone) AND (tsn.sample_timestamp < '2024-09-02 00:00:00'::timestamp without time zone))
Heap Blocks: exact=5723
Buffers: shared hit=5767
-> Bitmap Index Scan on time_series_numerical_pk (cost=0.00..194.84 rows=5622 width=0) (actual time=1.574..1.575 rows=5723 loops=1)
Index Cond: ((tsn.property_id = p.id) AND (tsn.sample_timestamp >= '2024-09-01 00:00:00'::timestamp without time zone) AND (tsn.sample_timestamp < '2024-09-02 00:00:00'::timestamp without time zone))
Buffers: shared hit=44
Planning:
Buffers: shared hit=78 read=11
Planning Time: 9.076 ms
Execution Time: 11.783 ms