I need to compare the performance of multiple pipeline runs, like time it took to complete. I have been trying to query the event logs of the pipeline but that does not contain information for the time the pipeline took to execute.
Basically, the aim is to find a query that can find the pipeline run which ran the quickest. Any ideas if this is possible?
SELECT event_type, SUM(double(details:flow_progress.metrics.executor_time_ms)) total_executor_time_ms
FROM
event_log_raw,
latest_update
WHERE
event_type ='flow_progress'
group by event_type
Ive only managed to sum the executor_time_ms to get the total executor_time_ms. Can this be considered as the duration for the pipeline run? there surely is a better way. Please help
You need to query using Update details like update id and update_progress
.
Below is the details in portal.
Here, if you see the runtime is 27s.
First, create event_log_raw
table from the delta live tables schema.
CREATE OR REPLACE TEMP VIEW event_log_raw AS SELECT * FROM delta.`dbfs:/dlt_temp/system/events/`;
dlt_temp
is my schema where all tables and logs stored.
Next, use below query to get the total time taken by each update in the same pipeline.
SELECT
event_type, origin.update_id, max(timestamp) maxi,min(timestamp) mini, timestampdiff(SECOND,min(timestamp),max(timestamp)) total_duration_secs
FROM
event_log_raw
WHERE event_type=='update_progress'
GROUP BY event_type,origin.update_id
ORDER BY maxi DESC
Here, i filtered the records for update_progress
and done group by on each update and event type.
output;
event_type | update_id | maxi | mini | total_duration_secs |
---|---|---|---|---|
update_progress | daa08062-c10e-458e-aa8d-b5fab66de380 | 2024-04-11T09:19:48.874+00:00 | 2024-04-11T09:19:21.413+00:00 | 27 |
update_progress | e761c89d-60f9-491e-b27b-279143726178 | 2024-04-11T09:15:19.459+00:00 | 2024-04-11T09:14:45.766+00:00 | 33 |
update_progress | bfddfb90-9acd-4afc-a88f-397398fc099e | 2024-04-11T08:55:02.337+00:00 | 2024-04-11T08:54:42.075+00:00 | 20 |
update_progress | 4bea13a7-dee0-4a7e-988b-868dd74d0d4d | 2024-04-11T08:53:43.013+00:00 | 2024-04-11T08:48:42.194+00:00 | 300 |
You can also get the latest update details by giving limit 1
, since we are sorted this record on timestamp descending.
EDIT For getting time duration for the tables for specific update use below query.
SELECT
origin.flow_name, max(timestamp) maxi,min(timestamp) mini, timestampdiff(SECOND,min(timestamp),max(timestamp)) total_duration_secs
FROM
event_log_raw,
(SELECT
event_type, origin.update_id, max(timestamp) maxi,min(timestamp) mini, timestampdiff(SECOND,min(timestamp),max(timestamp)) total_duration_secs
FROM
event_log_raw
WHERE event_type=='update_progress'
GROUP BY event_type,origin.update_id
ORDER BY maxi DESC limit 1) b
WHERE
timestamp>=b.mini and timestamp<=b.maxi and origin.flow_name='target_table'
GROUP BY origin.flow_name
Here, i have used limit 1
in inner query to get latest update time frame details and used it for getting duration for table target_table