sqlmariadbmariadb-10.11

Correlated queries without the LATERAL keyword


I am using MariaDB v10.11.15 (a MySQL fork) which does not support the LATERAL keyword, to lookup the countries of 30 IPv4 numbers. I have the following query that retrieves the country when given an IPv4 number (in decimal form), for example:

SELECT
    ips.country
FROM (
    SELECT *
    FROM ips 
    WHERE ip4_beg <= 36843009
    ORDER BY ip4_beg DESC
    LIMIT 1
    ) AS ips
WHERE ips.ip4_end >= 36843009;

The table ips contains millions of IPv4 ranges defined by the columns ip4_beg and ip4_end (inclusively). The ranges are unique and non-overlapping but there are gaps between the ranges, so it can happen that a given IPv4 number is not found.

There is a UNIQUE index on the column ips.ip4_beg.

HOW IT WORKS: The inner query finds the first start of range (ip4_beg) that is =< than the sought after IP number. The outer query only checks whether the end of this range (ip4_end) is >= than the sought after IP number. Due to the nature of the data it makes no sense for the inner query to return more than 1 row. Also if the condition in the WHERE clause fails in the outer query, then it makes no sense for the inner query to continue searching due to the nature of the data.

The query performs very well as illustrated by the following output of the ANALYZE command:

id  select_type table       type      p_keys   key      keylen  ref   rows     r_rows  filtered r_filtered  Extra
-------------------------------------------------------------------------------------------------------------------------
1   PRIMARY     <derived2>  ALL       NULL     NULL     NULL    NULL  2        1.00    100.00   100.00      Using where 
2   DERIVED     ips         range     idx_beg  idx_beg  4       NULL  1148647  1.00    100.00   100.00      Using where

r_rows is the number of rows that were actually processed.

In practical terms, the query above looks up the country of an IP number in less than 1ms with an index only on ips.ip4_beg.

I have a second query that simply retrieves 30 IPv4 numbers as well as id_member and url. It is as follows:

SELECT 
     id_member, url, ip
FROM log_online
ORDER BY id_member
LIMIT 0, 30;

There is a UNIQUE index on the column log_online.id_member The type of log_online.ip is the same as the type of ips.ip4_beg and ips.ip4_end.

Q: How to execute the first query that retrieves the ips.country for every log_online.ip number returned by the second query (i.e.: for the 30 ip numbers that it returns). The final output must contain: log_online.id_member, log_online.url, log_online.ip, ips.country

I see no reason for the combined queries to have the execution time longer than: second_query + first_query * 30, because the second_query returns 30 rows, so I will not accept answers with queries that have a longer execution time.

I tried using the first query as a correlated subquery in the SELECT list of the second query but I could not reference the log_online.ip or lo.ip inside it:

SELECT 
     id_member, url, ip,
     (SELECT
        innerq.country
      FROM (
             SELECT *
             FROM ips 
             WHERE ip4_beg <= lo.ip   -- it doesn't see lo.ip here
             ORDER BY ip4_beg DESC
             LIMIT 1
           ) AS innerq
      WHERE innerq.ip4_end >= lo.ip
     )
FROM log_online AS lo
ORDER BY id_member
LIMIT 0, 30;

I got a question in the comments why don't I try the following:

SELECT 
     id_member, url, ip, 
     (
      SELECT
         country
      FROM ips 
      WHERE lo.ip BETWEEN ip4_beg AND ip4_end
     )
FROM log_online AS lo
ORDER BY id_member
LIMIT 0, 30;

The output of the ANALYZE command speaks for itself:

id  select_type   table  type   p_keys   key        keylen  ref   rows     r_rows      filtered  r_filtered  Extra
--------------------------------------------------------------------------------------------------------------------------
1   PRIMARY       lo     index  NULL     id_member  3       NULL  531      30.00       100.00    100.00         
2   DEP SUBQUERY  ips    ALL    idx_beg  NULL       NULL    NULL  2418126  2418278.00  100.00      0.00      Using where  

Adding an ORDER BY clause and LIMIT 1 improves performance marginally:

ANALYZE SELECT 
     id_member, url, ip, 
     (
      SELECT
         country
      FROM ips 
      WHERE lo.ip BETWEEN ip4_beg AND ip4_end
      ORDER BY ip4_beg DESC
      LIMIT 1
     )
FROM log_online AS lo
ORDER BY id_member
LIMIT 0, 30;

Again, the output of the ANALYZE command speaks for itself:

id  select_type    table  type    p_keys   key        keylen  ref   rows     r_rows     filtered  r_filtered  Extra
---------------------------------------------------------------------------------------------------------------------------
1   PRIMARY        lo     index   NULL     id_member  3       NULL  410      30.00      100.00    100.00         
2   DEP SUBQUERY   ips    index   idx_beg  idx_beg    4       NULL  2418126  879673.03  100.00      0.00      Using where    

In practical terms, the query above takes ~4 minutes to look up the countries of only 30 IPs ...Ouch !!!

P.S.
In the future, I am planning to return more columns from the ips table ( from the same row !!! ) - most likely using the JSON_OBJECT() so a separate subquery is not run for multiple columns in the same row.


Solution

  • EDIT: Older suggestions removed.

    The following fiddle shows my working steps : https://dbfiddle.uk/u6k1SqY3


    First, I modified the logic very slightly


    Then I timed the lookup of a country for a single ip address.

    single address in variable =>  0.5ms
    single address in table    =>  1.0ms
    as correlated sub-query    => 60.0ms
    

    I tried numerous approaches, several still in the demo above, but I could never prevent the optimiser "being less than optimal" when processing a table of address.

    So, I embedded the single-address lookup in a loop, then a cursor. It was significantly better

    cursor over 100 addresses => 35ms
    

    Then, I wrapped the single-address lookup in a scalar valued user defined function

    function over 100 addresses => 8.5ms
    

    Note: All of these tests also had an index on (ip4_beg DESC, ip4_end, country)


    So, my conclusions are:


    CREATE FUNCTION get_country(v_ip BIGINT) RETURNS INT
    RETURN
    (
        SELECT
            CASE WHEN ips.ip4_end >= v_ip THEN ips.country END
        FROM
            ips 
        WHERE
            ip4_beg <= v_ip
        ORDER BY
            ip4_beg DESC
        LIMIT
            1            
    );
    
    SELECT
      id_member, ip, get_country(ip) AS country
    FROM
      log_online AS lo
    ORDER BY
      id_member
    ;