I'm trying to help a user in another stackoverflow question and have bumped by head into a strange behaviour.
As I rarely use MariaDB, I'm opening another question to investigate the behaviour,
Simplified schema...
CREATE TABLE source (
id INT,
some_data BIGINT,
PRIMARY KEY (id)
);
CREATE TABLE lookup (
id BIGINT,
a_value NVARCHAR(32),
PRIMARY KEY (id)
);
Simplified query...
SELECT
*
FROM
source AS s
INNER JOIN
lookup AS l
ON l.id = some_function(s.some_data)
Apparently, due to the function call, mariadb chooses Using join buffer (flat, BNL join)
If I push the function results into a holding table, mariadb chooses a simple join, and is 1000x faster.
CREATE TABLE interim AS
SELECT
s.*,
some_function(s.some_data) AS lookup_id
FROM
source AS s
;
SELECT
*
FROM
interim AS i
INNER JOIN
lookup AS l
ON l.id = i.lookup_id
;
(An interim table doesn't seem sensible, and all that's needed is a boring hash join.)
Expanded fiddle with random data, user-defined-function, crude timings, etc...
I suggest considering the DETERMINISTIC modifier when defining a stored function.
According to the conditions of the task, the function can be considered DETERMINISTIC, since we do not expect the data in the table to change after the query is started. The data is represented by an immutable log.
A function is deterministic if it can produce only one result for a given list of parameters. If the result may be affected by stored data, server variables, random numbers or any value that is not explicitly passed, then the function is not deterministic. Also, a function is non-deterministic if it uses nondeterministic functions like NOW() or CURRENT_TIMESTAMP().
The optimizer may choose a faster execution plan if it known that the function is deterministic. In such cases, you should declare the routine using the DETERMINISTIC keyword. If you want to explicitly state that the function is not deterministic (which is the default) you can use the NOT DETERMINISTIC keywords.
If you declare a non-deterministic function as DETERMINISTIC, you may get incorrect results. If you declare a deterministic function as NOT DETERMINISTIC, in some cases the queries will be slower.
We can see difference on model. Create 2 functions
-- NON DETERMINISTIC function (by default - NON DETERMINISTINC)
CREATE FUNCTION get_ip_range(v_ip BIGINT) RETURNS BIGINT
RETURN
(
SELECT
CASE WHEN ips.ip4_end >= v_ip THEN ips.ip4_beg ELSE -1 END
FROM
ips
WHERE
ip4_beg <= v_ip
ORDER BY
ip4_beg DESC
LIMIT
1
);
-- DETERMINISTIC function - this is only difference
CREATE FUNCTION get_ip_range2(v_ip BIGINT) RETURNS BIGINT DETERMINISTIC
RETURN
(
SELECT
CASE WHEN ips.ip4_end >= v_ip THEN ips.ip4_beg ELSE -1 END
FROM
ips
WHERE
ip4_beg <= v_ip
ORDER BY
ip4_beg DESC
LIMIT
1
);
Run test query an compare plan's
-- using DETERMINISTIC function
SET @start = current_timestamp(6);
ANALYZE
SELECT
lo.id_member,
lo.ip,
ip.ip4_beg,
ip.country
FROM
ips AS ip
INNER JOIN
log_online AS lo
ON ip.ip4_beg = get_ip_range2(lo.ip)
ORDER BY
lo.id_member
;
SELECT TIMESTAMPDIFF(microsecond, @start, current_timestamp(6)) / 1000.0 AS elapsed;
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | r_rows | filtered | r_filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | SIMPLE | lo | index | null | PRIMARY | 4 | null | 100 | 100.00 | 100.00 | 100.00 | |
| 1 | SIMPLE | ip | eq_ref | PRIMARY,reversed | PRIMARY | 8 | func | 1 | 1.00 | 100.00 | 100.00 | Using where |
| elapsed |
|---|
| 37.3400 |
-- SLOW
-- using NON DETERMINISTIC function (by default - NON DETERMINISTINC
SET @start = current_timestamp(6);
ANALYZE
SELECT
lo.id_member,
lo.ip,
ip.ip4_beg,
ip.country
FROM
ips AS ip
INNER JOIN
log_online AS lo
ON ip.ip4_beg = get_ip_range(lo.ip)
ORDER BY
lo.id_member
;
SELECT TIMESTAMPDIFF(microsecond, @start, current_timestamp(6)) / 1000.0 AS elapsed;
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | r_rows | filtered | r_filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | SIMPLE | lo | ALL | null | null | null | null | 100 | 100.00 | 100.00 | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | ip | index | null | reversed | 22 | null | 1602 | 1602.00 | 100.00 | 0.06 | Using where; Using index; Using join buffer (flat, BNL join) |
| elapsed |
|---|
| 11897.9650 |
Update1 I tried to see how many times the get_ip_range() function is called.
With a table "ips" size 802 rows and "log_online" of 100 rows:
For a deterministic function, the call is made ~2 times for each row of the log_online table, or a total of 2*100=200 times.
For a non-deterministic function, the call is made 802 times for each row, or a total of 100*802=80200 times .
fiddle