azure-databricksdatabricks-sqldatabricks-unity-catalog

Compare the run duration of multiple runs of a DLT pipeline


I need to compare the performance of multiple pipeline runs, like time it took to complete. I have been trying to query the event logs of the pipeline but that does not contain information for the time the pipeline took to execute.

Basically, the aim is to find a query that can find the pipeline run which ran the quickest. Any ideas if this is possible?

SELECT event_type, SUM(double(details:flow_progress.metrics.executor_time_ms)) total_executor_time_ms
FROM 
  event_log_raw,
  latest_update
  WHERE
  event_type ='flow_progress'
  group by event_type

Ive only managed to sum the executor_time_ms to get the total executor_time_ms. Can this be considered as the duration for the pipeline run? there surely is a better way. Please help


Solution

  • You need to query using Update details like update id and update_progress. Below is the details in portal.

    enter image description here

    Here, if you see the runtime is 27s.

    First, create event_log_raw table from the delta live tables schema.

    CREATE  OR  REPLACE TEMP VIEW event_log_raw AS  SELECT  *  FROM delta.`dbfs:/dlt_temp/system/events/`;
    

    dlt_temp is my schema where all tables and logs stored.

    Next, use below query to get the total time taken by each update in the same pipeline.

    SELECT
      event_type, origin.update_id, max(timestamp) maxi,min(timestamp) mini, timestampdiff(SECOND,min(timestamp),max(timestamp)) total_duration_secs
    FROM
      event_log_raw
    WHERE event_type=='update_progress'
    GROUP BY event_type,origin.update_id
    ORDER BY maxi DESC
    

    Here, i filtered the records for update_progress and done group by on each update and event type.

    output;

    event_type update_id maxi mini total_duration_secs
    update_progress daa08062-c10e-458e-aa8d-b5fab66de380 2024-04-11T09:19:48.874+00:00 2024-04-11T09:19:21.413+00:00 27
    update_progress e761c89d-60f9-491e-b27b-279143726178 2024-04-11T09:15:19.459+00:00 2024-04-11T09:14:45.766+00:00 33
    update_progress bfddfb90-9acd-4afc-a88f-397398fc099e 2024-04-11T08:55:02.337+00:00 2024-04-11T08:54:42.075+00:00 20
    update_progress 4bea13a7-dee0-4a7e-988b-868dd74d0d4d 2024-04-11T08:53:43.013+00:00 2024-04-11T08:48:42.194+00:00 300

    You can also get the latest update details by giving limit 1, since we are sorted this record on timestamp descending.

    enter image description here

    EDIT For getting time duration for the tables for specific update use below query.

    SELECT
    origin.flow_name, max(timestamp) maxi,min(timestamp) mini, timestampdiff(SECOND,min(timestamp),max(timestamp)) total_duration_secs
    FROM
      event_log_raw,
      (SELECT
      event_type, origin.update_id, max(timestamp) maxi,min(timestamp) mini, timestampdiff(SECOND,min(timestamp),max(timestamp)) total_duration_secs
    FROM
      event_log_raw
    WHERE event_type=='update_progress'
    GROUP BY event_type,origin.update_id
    ORDER BY maxi DESC limit 1) b
    WHERE 
    timestamp>=b.mini and timestamp<=b.maxi and origin.flow_name='target_table'
    GROUP BY origin.flow_name
    

    Here, i have used limit 1 in inner query to get latest update time frame details and used it for getting duration for table target_table