performancegoogle-bigquerygeoip

How to improve performance of GeoIP query in BigQuery?


I have loaded my application logs in BigQuery and I need to calculate country based on IP address from those logs.

I have written a join query between my table and a GeoIP mapping table that I downloaded from MaxMind.

An ideal query would be OUTER JOIN with range filter, however BQ supports only = in join conditions. So the query does an INNER JOIN and handles missing values in each side of the JOIN.

I have amended my original query so it could run on the Wikipedia public data set.

Can someone please help me make this run faster?

SELECT id, client_ip, client_ip_code, B.Country_Name as Country_Name

FROM
    (SELECT id, contributor_ip as client_ip, INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code, 1 AS One
    FROM [publicdata:samples.wikipedia] Limit 1000) AS A1

JOIN 
    (SELECT From_IP_Code, To_IP_Code, Country_Name, 1 AS One
    FROM

        -- 3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the end of the set
        -- all Ranges of valid IPs:
        (SELECT From_IP_Code, To_IP_Code, Country_Name FROM [QA_DATASET.GeoIP])

        -- Missing rages lower from From_IP 
        ,(SELECT
            PriorRangeEndIP + 1 From_IP_Code, 
            From_IP_Code - 1 AS To_IP_Code, 
            'NA' AS Country_Name
        FROM

            -- use of LAG function to find prior valid range
            (SELECT 
                From_IP_Code, 
                To_IP_Code, Country_Name, 
                LAG(To_IP_Code, 1, INTEGER(0)) 
                OVER(ORDER BY From_IP_Code asc) PriorRangeEndIP                 
            FROM [QA_DATASET.GeoIP]) A

            -- If gap from prior valid range is > 1 than its a gap to fill
            WHERE From_IP_Code > PriorRangeEndIP + 1)

        -- Missing rages higher tan Max To_IP
        ,(SELECT MAX(To_IP_Code) + 1 as From_IP_Code, INTEGER(4311810304) as To_IP_Code, 'NA' AS Country_Name
        FROM [QA_DATASET.GeoIP])
    ) AS B
ON A1.ONE = B.ONE    -- fake join condition to overcome allowed use of only = in joins

-- Join condition where valid IP exists on left
WHERE
    A1.client_ip_code >= B.From_IP_Code
    AND A1.client_ip_code <= B.To_IP_Code
    OR (A1.client_ip_code IS NULL 
    AND B.From_IP_Code = 1)    -- where there is no valid IP on left contributor_ip

Solution

  • 2019, much improved answer:

    #standardSQL
    # replace with your source of IP addresses
    # here I'm using the same Wikipedia set from the previous article
    WITH source_of_ip_addresses AS (
      SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0')  ip, COUNT(*) c
      FROM `publicdata.samples.wikipedia`
      WHERE contributor_ip IS NOT null  
      GROUP BY 1
    )
    SELECT country_name, SUM(c) c
    FROM (
      SELECT ip, country_name, c
      FROM (
        SELECT *, NET.SAFE_IP_FROM_STRING(ip) & NET.IP_NET_MASK(4, mask) network_bin
        FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
        WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
      )
      JOIN `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs`  
      USING (network_bin, mask)
    )
    GROUP BY 1
    ORDER BY 2 DESC
    

    Cleaned up version of this answer at: http://googlecloudplatform.blogspot.com/2014/03/geoip-geolocation-with-google-bigquery.html

    Let me tidy the original query:

    SELECT
      id,
      client_ip,
      client_ip_code,
      B.Country_Name AS Country_Name
    FROM (
      SELECT
        id,
        contributor_ip AS  client_ip,
        INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
        1 AS One
      FROM
        [publicdata:samples.wikipedia]
      WHERE contributor_ip IS NOT NULL
      LIMIT
        1000
        ) AS A1
    LEFT JOIN
      (
      SELECT
        From_IP_Code,
        To_IP_Code,
        Country_Name,
        1 AS One
      FROM
        --3 IP sets: 1.valid ranges,  2.Gaps,  3. Gap at the END of the set
        (
        SELECT
          From_IP_Code,
          To_IP_Code,
          Country_Name
        FROM
          [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
        ,
        (
        SELECT
          PriorRangeEndIP+1 From_IP_Code,
          From_IP_Code-1 AS To_IP_Code,
          'NA' AS Country_Name -- Missing rages lower    FROM      From_IP
        from(
          SELECT
            From_IP_Code,
            To_IP_Code,
            Country_Name
            ,
            LAG(To_IP_Code,
              1,
              INTEGER(0)) OVER(
            ORDER BY
              From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
          FROM
            [playscape-proj:GeoIP.GeoIP])A
        WHERE
         From_IP_Code>PriorRangeEndIP+1) -- If gap  FROM  prior valid range IS >1 than its a gap to fill
          ,
        (
        SELECT
          MAX(To_IP_Code)+1 AS From_IP_Code,
          INTEGER (4311810304) AS To_IP_Code,
          'NA' AS Country_Name -- Missing rages higher tan Max To_IP
        FROM
          [playscape-proj:GeoIP.GeoIP])
        ) AS B
      ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
    WHERE
      A1.client_ip_code>=B.From_IP_Code
      AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
      OR (A1.client_ip_code IS NULL
        AND B.From_IP_Code=1 ) -- WHERE  there IS no valid IP ON left contributor_ip;
    

    That's a long query! (and a very interesting one). It runs in 14 seconds. How can we optimize it?

    Some tricks I found:

    So I'm changing:

    And now it runs in 3 seconds! 5% of the ips could not be geolocated, probably by the described gaps (easy fix).

    Now, how about going from the LIMIT 1000 to LIMIT 300000. How long will it take?

    37 seconds! Much better than the described 25 minutes. If you want to go even higher, I would suggest turning the right side table into a static one - as once computed it doesn't change at all, it's just an expansion of the basic rules. Then you can use JOIN EACH.

    SELECT
      id,
      client_ip,
      client_ip_code,
      B.Country_Name AS Country_Name
    FROM (
      SELECT
        id,
        contributor_ip AS  client_ip,
        INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
        INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One
      FROM
        [publicdata:samples.wikipedia]
      WHERE contributor_ip IS NOT NULL
      LIMIT
        300000
        ) AS A1
    JOIN 
      (
      SELECT
        From_IP_Code,
        To_IP_Code,
        Country_Name,
        INTEGER(From_IP_Code/(256*256*256)) AS One
      FROM
        --3 IP sets: 1.valid ranges,  2.Gaps,  3. Gap at the END of the set
        (
        SELECT
          From_IP_Code,
          To_IP_Code,
          Country_Name
        FROM
          [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
        ,
        (
        SELECT
          PriorRangeEndIP+1 From_IP_Code,
          From_IP_Code-1 AS To_IP_Code,
          'NA' AS Country_Name -- Missing rages lower    FROM      From_IP
        from(
          SELECT
            From_IP_Code,
            To_IP_Code,
            Country_Name
            ,
            LAG(To_IP_Code,
              1,
              INTEGER(0)) OVER(
            ORDER BY
              From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
          FROM
            [playscape-proj:GeoIP.GeoIP])A
        WHERE
         From_IP_Code>PriorRangeEndIP+1) -- If gap  FROM  prior valid range IS >1 than its a gap to fill
          ,
        (
        SELECT
          MAX(To_IP_Code)+1 AS From_IP_Code,
          INTEGER (4311810304) AS To_IP_Code,
          'NA' AS Country_Name -- Missing rages higher tan Max To_IP
        FROM
          [playscape-proj:GeoIP.GeoIP])
        ) AS B
      ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
    WHERE
      A1.client_ip_code>=B.From_IP_Code
      AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
      OR (A1.client_ip_code IS NULL
        AND B.From_IP_Code=1 ) -- WHERE  there IS no valid IP ON left contributor_ip;