sqlperformancegoogle-bigquerysubqueryprocessing-efficiency

Comparing efficiency of two queries


I am learning SQL, and while working through guided examples, I like to attempt writing my own query to answer a question before seeing how the lesson suggests doing it. In this case, I used a somewhat different approach, and I'm interested in comparing my method to the given example.

The guided example uses a public dataset available on BigQuery called bigquery-public-data.new_york_citibike. The problem posed is to find Citibike rides with trip durations that most exceeded the average trip duration for its starting bike station.

Here is the code suggested in the guided example:

SELECT
  starttime,
  start_station_id,
  tripduration,
  (
    SELECT ROUND(AVG(tripduration),2)
    FROM bigquery-public-data.new_york_citibike.citibike_trips
    WHERE start_station_id = outer_trips.start_station_id
  ) AS avg_duration_for_station,
  ROUND(tripduration - (
    SELECT AVG(tripduration)
    FROM bigquery-public-data.new_york_citibike.citibike_trips
    WHERE start_station_id = outer_trips.start_station_id), 2) AS difference_from_avg
FROM bigquery-public-data.new_york_citibike.citibike_trips AS outer_trips
ORDER BY difference_from_avg DESC
LIMIT 25

And here is the code I came up with before looking at their solution:

SELECT
  starttime,
  start_station_id,
  tripduration,
  station_averages.station_average AS station_avg,
  ROUND (tripduration - station_averages.station_average,2) AS diff_from_station_avg
FROM `bigquery-public-data.new_york_citibike.citibike_trips`
  JOIN (
      SELECT
        start_station_id AS station_id,
        ROUND(AVG(tripduration),2) as station_average
      FROM `bigquery-public-data.new_york_citibike.citibike_trips`
      GROUP BY station_id
    ) AS station_averages
  ON start_station_id = station_averages.station_id
ORDER BY 5 DESC
LIMIT 25

Looking at my code versus theirs, I thought mine would run more quickly, because I'm only calculating the average trip duration for each station once, when making the JOIN table. Their code, on the other hand, calculates each station average as many times as that station appears in the table.

In order to compare the two approaches, I put them each in a BigQuery tab, and after running them and verifying that they do the same thing, I looked at the "Job Information" for each query. The first one, supplied by the course I'm taking took "0 seconds", and mine took "1 second".

In order to see the comparison better, I tried editing the final line of each query to

LIMIT 250

...and then 2500, and 25000, etc. Their query kept running just about 1 second faster than mine, until I increased the limit all the way to 2.5 million. Then, mine ran in 17 seconds, and theirs ran in 18 seconds.

Can anyone help me understand what's going on here? Was my understanding of the code correct when I guessed that my query was more efficient? Why did the processing time scale differently for theirs and mine? Why does changing the limit even affect the time, when it has to calculate all results for every row in the table before it can ORDER BY column 5?

I hope these questions make sense. Thanks in advance for any insignts.


Solution

  • As commented by @Baalback when checking performance , you have to look at the query plan , what are the indexes is being use , how much is the cost and cardenality(num of rows), so you have get the query plan for two queries .

    However to make your query more efficient or evaluating the query performance in BigQuery , it depends on some factors :

    Many of these factors can be observed through the query plan. A query plan is generated for each query submitted to BigQuery. We can see execution statistics such as bytes read and slot time consumed.

    You can go through this link Optimize query computation and do your query more efficient and execute quickly .