I am using MariaDB v10.11.15 (a MySQL fork) which does not support the LATERAL keyword, to lookup the countries of 30 IPv4 numbers. I have the following query that retrieves the country when given an IPv4 number (in decimal form), for example:
SELECT
ips.country
FROM (
SELECT *
FROM ips
WHERE ip4_beg <= 36843009
ORDER BY ip4_beg DESC
LIMIT 1
) AS ips
WHERE ips.ip4_end >= 36843009;
The table ips contains millions of IPv4 ranges defined by the columns ip4_beg and ip4_end (inclusively). The ranges are unique and non-overlapping but there are gaps between the ranges, so it can happen that a given IPv4 number is not found.
There is a UNIQUE index on the column ips.ip4_beg.
HOW IT WORKS: The inner query finds the first start of range (ip4_beg) that is =< than the sought after IP number. The outer query only checks whether the end of this range (ip4_end) is >= than the sought after IP number. Due to the nature of the data it makes no sense for the inner query to return more than 1 row. Also if the condition in the WHERE clause fails in the outer query, then it makes no sense for the inner query to continue searching due to the nature of the data.
The query performs very well as illustrated by the following output of the ANALYZE command:
id select_type table type p_keys key keylen ref rows r_rows filtered r_filtered Extra
-------------------------------------------------------------------------------------------------------------------------
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 2 1.00 100.00 100.00 Using where
2 DERIVED ips range idx_beg idx_beg 4 NULL 1148647 1.00 100.00 100.00 Using where
r_rows is the number of rows that were actually processed.
In practical terms, the query above looks up the country of an IP number in less than 1ms with an index only on ips.ip4_beg.
I have a second query that simply retrieves 30 IPv4 numbers as well as id_member and url. It is as follows:
SELECT
id_member, url, ip
FROM log_online
ORDER BY id_member
LIMIT 0, 30;
There is a UNIQUE index on the column log_online.id_member
The type of log_online.ip is the same as the type of ips.ip4_beg and ips.ip4_end.
Q: How to execute the first query that retrieves the ips.country for every log_online.ip number returned by the second query (i.e.: for the 30 ip numbers that it returns). The final output must contain:
log_online.id_member,
log_online.url,
log_online.ip,
ips.country
I see no reason for the combined queries to have the execution time longer than: second_query + first_query * 30, because the second_query returns 30 rows, so I will not accept answers with queries that have a longer execution time.
I tried using the first query as a correlated subquery in the SELECT list of the second query but I could not reference the log_online.ip or lo.ip inside it:
SELECT
id_member, url, ip,
(SELECT
innerq.country
FROM (
SELECT *
FROM ips
WHERE ip4_beg <= lo.ip -- it doesn't see lo.ip here
ORDER BY ip4_beg DESC
LIMIT 1
) AS innerq
WHERE innerq.ip4_end >= lo.ip
)
FROM log_online AS lo
ORDER BY id_member
LIMIT 0, 30;
I got a question in the comments why don't I try the following:
SELECT
id_member, url, ip,
(
SELECT
country
FROM ips
WHERE lo.ip BETWEEN ip4_beg AND ip4_end
)
FROM log_online AS lo
ORDER BY id_member
LIMIT 0, 30;
The output of the ANALYZE command speaks for itself:
id select_type table type p_keys key keylen ref rows r_rows filtered r_filtered Extra
--------------------------------------------------------------------------------------------------------------------------
1 PRIMARY lo index NULL id_member 3 NULL 531 30.00 100.00 100.00
2 DEP SUBQUERY ips ALL idx_beg NULL NULL NULL 2418126 2418278.00 100.00 0.00 Using where
Adding an ORDER BY clause and LIMIT 1 improves performance marginally:
ANALYZE SELECT
id_member, url, ip,
(
SELECT
country
FROM ips
WHERE lo.ip BETWEEN ip4_beg AND ip4_end
ORDER BY ip4_beg DESC
LIMIT 1
)
FROM log_online AS lo
ORDER BY id_member
LIMIT 0, 30;
Again, the output of the ANALYZE command speaks for itself:
id select_type table type p_keys key keylen ref rows r_rows filtered r_filtered Extra
---------------------------------------------------------------------------------------------------------------------------
1 PRIMARY lo index NULL id_member 3 NULL 410 30.00 100.00 100.00
2 DEP SUBQUERY ips index idx_beg idx_beg 4 NULL 2418126 879673.03 100.00 0.00 Using where
In practical terms, the query above takes ~4 minutes to look up the countries of only 30 IPs ...Ouch !!!
P.S.
In the future, I am planning to return more columns from the ips table ( from the same row !!! ) - most likely using the JSON_OBJECT() so a separate subquery is not run for multiple columns in the same row.
EDIT: Older suggestions removed.
The following fiddle shows my working steps : https://dbfiddle.uk/u6k1SqY3
First, I modified the logic very slightly
AND ip4_end >= ip filter with a CASE expressionThen I timed the lookup of a country for a single ip address.
single address in variable => 0.5ms
single address in table => 1.0ms
as correlated sub-query => 60.0ms
I tried numerous approaches, several still in the demo above, but I could never prevent the optimiser "being less than optimal" when processing a table of address.
So, I embedded the single-address lookup in a loop, then a cursor. It was significantly better
cursor over 100 addresses => 35ms
Then, I wrapped the single-address lookup in a scalar valued user defined function
function over 100 addresses => 8.5ms
Note: All of these tests also had an index on (ip4_beg DESC, ip4_end, country)
So, my conclusions are:
CASE expression allows one layer of the sub-query to be removed (so it compiles)CREATE FUNCTION get_country(v_ip BIGINT) RETURNS INT
RETURN
(
SELECT
CASE WHEN ips.ip4_end >= v_ip THEN ips.country END
FROM
ips
WHERE
ip4_beg <= v_ip
ORDER BY
ip4_beg DESC
LIMIT
1
);
SELECT
id_member, ip, get_country(ip) AS country
FROM
log_online AS lo
ORDER BY
id_member
;