mysqlsqlrefactoringsql-tuning

How to reduce redundant MySQL function calls in a single query?


SELECT hour(datetime), COUNT(animal_id)
FROM animal_outs 
WHERE hour(datetime) > 8 AND hour(datetime) < 20
GROUP BY hour(datetime)

I am learning SQL. I am calling hour(datetime) four times in my query. I am curious 1) if this redundancy affects performance, and 2) how I can simplify this redundant code.


Solution

  • Does this affect performance?

    Probably not in any meaningful way. The performance of queries is usually dominated by the work done to retrieve and process data. This is typically much more expensive than the overhead for built in functions (although there are some exceptions, such as regular expressions which can be rather expensive).

    MySQL allows column aliases in the GROUP BY. So a valid "simplification" is:

    SELECT hour(datetime) as hh, COUNT(animal_id)
    FROM animal_outs 
    WHERE hour(datetime) > 8 AND hour(datetime) < 20
    GROUP BY hh;
    

    Two versions that are likely to make things worse might look simpler to you, but are not. The first is to use having:

    SELECT hour(datetime) as hh, COUNT(animal_id)
    FROM animal_outs 
    GROUP BY hh
    HAVING hh > 8 AND hh < 20
    

    Technically, this does what you want. But because it filters after the aggregation, it is doing extra work on the GROUP BY. That likely outweighs any savings on not calling hour().

    Another method is a subquery:

    SELECT hh, COUNT(animal_id)
    FROM (SELECT hour(datetime) as hh, animal_id
          FROM animal_outs 
         ) ao
    WHERE hh > 8 AND hh < 20
    GROUP BY hh;
    

    In most databases, this would do what you want. And it might in the most recent versions of MySQL. However, MySQL has an irritating tendency to materialize (i.e. write to disk) subqueries in the FROM clause. That adds extra overhead -- once again, probably more than the additional calls to hour().

    Note: It is possible that hour() is a perniciously expensive function and you might find that either of the last two solutions are faster. Also, you will probably only see an effect on performance if you data has at least a few thousand rows. Trivially small tables (a few dozen or hundred rows) are usually processed quickly regardless of such concerns.