I’m working with a PostgreSQL table that stores metric data for different assets. The table currently has over 1 billion records.
Each update will have multiple metrics, e.g., speed, distance, elevation, heading, temp. Each metric is stored as an individual row with asset_id
, timestamp
, metric_name
, and metric_value
.
I want to go back and assign a unique UUID (update_id
) to each group of metrics recorded at the same asset_id
and timestamp
. However, if a metric_name
is repeated within the same asset_id
and timestamp
, I need it to start a new update_id
.
db<>fiddle
create table example_input(
asset_id uuid
,timestamp timestamp(0)
,metric_name text
,metric_value int);
insert into example_input values
('07536cba-6e00-4b8a-be02-9509228f4bb0','2024-12-16 17:43:14','distance',100)
,('07536cba-6e00-4b8a-be02-9509228f4bb0','2024-12-16 17:43:14','speed',60)
,('07536cba-6e00-4b8a-be02-9509228f4bb0','2024-12-16 17:43:14','elevation',15)
,('07536cba-6e00-4b8a-be02-9509228f4bb0','2024-12-16 17:43:14','distance',120)
,('07536cba-6e00-4b8a-be02-9509228f4bb0','2024-12-16 17:43:14','speed',80)
,('07536cba-6e00-4b8a-be02-9509228f4bb0','2024-12-16 17:43:19','distance',140)
,('07536cba-6e00-4b8a-be02-9509228f4bb0','2024-12-16 17:43:19','elevation',20);
Expected Output:
asset_id | timestamp | metric_name | metric_value | update_id |
---|---|---|---|---|
07536cba-6e00-4b8a-be02-9509228f4bb0 | 2024-12-16 17:43:14 | distance | 100 | 6e0c8f3a-9c92-4a99-b0ff-7e6ac1cd6bf2 |
07536cba-6e00-4b8a-be02-9509228f4bb0 | 2024-12-16 17:43:14 | speed | 60 | 6e0c8f3a-9c92-4a99-b0ff-7e6ac1cd6bf2 |
07536cba-6e00-4b8a-be02-9509228f4bb0 | 2024-12-16 17:43:14 | elevation | 15 | 6e0c8f3a-9c92-4a99-b0ff-7e6ac1cd6bf2 |
07536cba-6e00-4b8a-be02-9509228f4bb0 | 2024-12-16 17:43:14 | distance | 120 | f9dcfa8e-95c4-498d-8a59-30233d1786b3 |
07536cba-6e00-4b8a-be02-9509228f4bb0 | 2024-12-16 17:43:14 | speed | 80 | f9dcfa8e-95c4-498d-8a59-30233d1786b3 |
07536cba-6e00-4b8a-be02-9509228f4bb0 | 2024-12-16 17:43:19 | distance | 140 | 78ec74e7-a5dc-4673-99e3-ef26b76b2d42 |
07536cba-6e00-4b8a-be02-9509228f4bb0 | 2024-12-16 17:43:19 | elevation | 20 | 78ec74e7-a5dc-4673-99e3-ef26b76b2d42 |
Due to hardware limitations, the devices only have 1 second resolution, and there are (rare) instances when the device will send more than one update in the same second. In this case, I want to detect a repeated asset_id
, timestamp
and metric_name
and have it generate a new uuid for the second update with the same second. This is the part that's giving the most grief that I can't figure out.
The uuid_generate_v3()
function from the "uuid-ossp"
extension lets you generate uuid
s in a repeatable manner. Use asset_id
as the namespace and timestamp
as the seed.
create extension if not exists "uuid-ossp";
To differentiate between multiple entries for the same metric_name
on the same timestamp
, you can add the row_number()
window function to the seed.
As a select
: demo at db<>fiddle
select*,uuid_generate_v3( asset_id
,concat(timestamp,row_number()over w))as update_id
from example_input
window w as(partition by asset_id,timestamp,metric_name
order by ctid)
As an actual update
:
alter table example_input add column update_id uuid;
update example_input as a
set update_id=uuid_generate_v3(asset_id,concat(timestamp,(
select row_number()over(order by ctid)
from example_input as c
where a.asset_id=c.asset_id
and a.timestamp=c.timestamp
and a.metric_name=c.metric_name
order by c.ctid<>a.ctid limit 1)))--only get the current row's rownum
asset_id | timestamp | metric_name | metric_value | update_id |
---|---|---|---|---|
07536cba-6e00-4b8a-be02-9509228f4bb0 | 2024-12-16 17:43:14 | distance | 100 | 7e88679b-5038-3124-b0e6-a4ed16316ba4 |
07536cba-6e00-4b8a-be02-9509228f4bb0 | 2024-12-16 17:43:14 | speed | 60 | 7e88679b-5038-3124-b0e6-a4ed16316ba4 |
07536cba-6e00-4b8a-be02-9509228f4bb0 | 2024-12-16 17:43:14 | elevation | 15 | 7e88679b-5038-3124-b0e6-a4ed16316ba4 |
07536cba-6e00-4b8a-be02-9509228f4bb0 | 2024-12-16 17:43:14 | distance | 120 | d1ca0905-7fa1-3f8c-9284-2d615bdaacd3 |
07536cba-6e00-4b8a-be02-9509228f4bb0 | 2024-12-16 17:43:14 | speed | 80 | d1ca0905-7fa1-3f8c-9284-2d615bdaacd3 |
07536cba-6e00-4b8a-be02-9509228f4bb0 | 2024-12-16 17:43:19 | distance | 140 | 40e25bd6-55eb-34cd-a8c5-262cd9d8987b |
07536cba-6e00-4b8a-be02-9509228f4bb0 | 2024-12-16 17:43:19 | elevation | 20 | 40e25bd6-55eb-34cd-a8c5-262cd9d8987b |
The ctid
is a hidden system column holding each row's page and tuple number. You can still break the ties without it (remove order by ctid
in both cases), but a lower ctid
typically indicates an earlier entry which could make some sense in your case. The resulting uuid
obfuscates that order (earlier entry won't translate to a "lower identifier") but it pairs up Nth metrics in each second: third speed
for a given second ends up under the same update_id
as the third elevation
.