How to improve performance of GeoIP query in BigQuery?
Asked Answered
S

3

11

I have loaded my application logs in BigQuery and I need to calculate country based on IP address from those logs.

I have written a join query between my table and a GeoIP mapping table that I downloaded from MaxMind.

An ideal query would be OUTER JOIN with range filter, however BQ supports only = in join conditions. So the query does an INNER JOIN and handles missing values in each side of the JOIN.

I have amended my original query so it could run on the Wikipedia public data set.

Can someone please help me make this run faster?

SELECT id, client_ip, client_ip_code, B.Country_Name as Country_Name

FROM
    (SELECT id, contributor_ip as client_ip, INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code, 1 AS One
    FROM [publicdata:samples.wikipedia] Limit 1000) AS A1

JOIN 
    (SELECT From_IP_Code, To_IP_Code, Country_Name, 1 AS One
    FROM

        -- 3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the end of the set
        -- all Ranges of valid IPs:
        (SELECT From_IP_Code, To_IP_Code, Country_Name FROM [QA_DATASET.GeoIP])

        -- Missing rages lower from From_IP 
        ,(SELECT
            PriorRangeEndIP + 1 From_IP_Code, 
            From_IP_Code - 1 AS To_IP_Code, 
            'NA' AS Country_Name
        FROM

            -- use of LAG function to find prior valid range
            (SELECT 
                From_IP_Code, 
                To_IP_Code, Country_Name, 
                LAG(To_IP_Code, 1, INTEGER(0)) 
                OVER(ORDER BY From_IP_Code asc) PriorRangeEndIP                 
            FROM [QA_DATASET.GeoIP]) A

            -- If gap from prior valid range is > 1 than its a gap to fill
            WHERE From_IP_Code > PriorRangeEndIP + 1)

        -- Missing rages higher tan Max To_IP
        ,(SELECT MAX(To_IP_Code) + 1 as From_IP_Code, INTEGER(4311810304) as To_IP_Code, 'NA' AS Country_Name
        FROM [QA_DATASET.GeoIP])
    ) AS B
ON A1.ONE = B.ONE    -- fake join condition to overcome allowed use of only = in joins

-- Join condition where valid IP exists on left
WHERE
    A1.client_ip_code >= B.From_IP_Code
    AND A1.client_ip_code <= B.To_IP_Code
    OR (A1.client_ip_code IS NULL 
    AND B.From_IP_Code = 1)    -- where there is no valid IP on left contributor_ip
Stratton answered 27/10, 2013 at 12:37 Comment(2)
that looks cool! looking into it.. could you put the GeoIP table in a new dataset and make it public (to try queries out). Thanks!Lemuellemuela
I've been playing with this. Thanks for making the table public. The hardest part so far has been dealing with the nulls. If nulls could be skipped, it can be done much faster - starting with the query you wrote.Lemuellemuela
L
17

2019, much improved answer:

#standardSQL
# replace with your source of IP addresses
# here I'm using the same Wikipedia set from the previous article
WITH source_of_ip_addresses AS (
  SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0')  ip, COUNT(*) c
  FROM `publicdata.samples.wikipedia`
  WHERE contributor_ip IS NOT null  
  GROUP BY 1
)
SELECT country_name, SUM(c) c
FROM (
  SELECT ip, country_name, c
  FROM (
    SELECT *, NET.SAFE_IP_FROM_STRING(ip) & NET.IP_NET_MASK(4, mask) network_bin
    FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
    WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
  )
  JOIN `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs`  
  USING (network_bin, mask)
)
GROUP BY 1
ORDER BY 2 DESC

Cleaned up version of this answer at: http://googlecloudplatform.blogspot.com/2014/03/geoip-geolocation-with-google-bigquery.html

Let me tidy the original query:

SELECT
  id,
  client_ip,
  client_ip_code,
  B.Country_Name AS Country_Name
FROM (
  SELECT
    id,
    contributor_ip AS  client_ip,
    INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
    1 AS One
  FROM
    [publicdata:samples.wikipedia]
  WHERE contributor_ip IS NOT NULL
  LIMIT
    1000
    ) AS A1
LEFT JOIN
  (
  SELECT
    From_IP_Code,
    To_IP_Code,
    Country_Name,
    1 AS One
  FROM
    --3 IP sets: 1.valid ranges,  2.Gaps,  3. Gap at the END of the set
    (
    SELECT
      From_IP_Code,
      To_IP_Code,
      Country_Name
    FROM
      [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
    ,
    (
    SELECT
      PriorRangeEndIP+1 From_IP_Code,
      From_IP_Code-1 AS To_IP_Code,
      'NA' AS Country_Name -- Missing rages lower    FROM      From_IP
    from(
      SELECT
        From_IP_Code,
        To_IP_Code,
        Country_Name
        ,
        LAG(To_IP_Code,
          1,
          INTEGER(0)) OVER(
        ORDER BY
          From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
      FROM
        [playscape-proj:GeoIP.GeoIP])A
    WHERE
     From_IP_Code>PriorRangeEndIP+1) -- If gap  FROM  prior valid range IS >1 than its a gap to fill
      ,
    (
    SELECT
      MAX(To_IP_Code)+1 AS From_IP_Code,
      INTEGER (4311810304) AS To_IP_Code,
      'NA' AS Country_Name -- Missing rages higher tan Max To_IP
    FROM
      [playscape-proj:GeoIP.GeoIP])
    ) AS B
  ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
WHERE
  A1.client_ip_code>=B.From_IP_Code
  AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
  OR (A1.client_ip_code IS NULL
    AND B.From_IP_Code=1 ) -- WHERE  there IS no valid IP ON left contributor_ip;

That's a long query! (and a very interesting one). It runs in 14 seconds. How can we optimize it?

Some tricks I found:

  • Skip NULLs. If there is no ip address in a log, don't try to match it.
  • Reduce the combinations. Instead of JOINing every left side record with every right side record, how about joining only the 39.x.x.x records on the left side with the 39.x.x.x records on the right side. There are only a few (3 or 4) rules that cover multiple ranges. It would be easy to add a couple of rules on the geolite table to add rules to cover these gaps.

So I'm changing:

  • 1 AS One to INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One (twice).
  • Adding a 'WHERE contributor_ip IS NOT NULL`.

And now it runs in 3 seconds! 5% of the ips could not be geolocated, probably by the described gaps (easy fix).

Now, how about going from the LIMIT 1000 to LIMIT 300000. How long will it take?

37 seconds! Much better than the described 25 minutes. If you want to go even higher, I would suggest turning the right side table into a static one - as once computed it doesn't change at all, it's just an expansion of the basic rules. Then you can use JOIN EACH.

SELECT
  id,
  client_ip,
  client_ip_code,
  B.Country_Name AS Country_Name
FROM (
  SELECT
    id,
    contributor_ip AS  client_ip,
    INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
    INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One
  FROM
    [publicdata:samples.wikipedia]
  WHERE contributor_ip IS NOT NULL
  LIMIT
    300000
    ) AS A1
JOIN 
  (
  SELECT
    From_IP_Code,
    To_IP_Code,
    Country_Name,
    INTEGER(From_IP_Code/(256*256*256)) AS One
  FROM
    --3 IP sets: 1.valid ranges,  2.Gaps,  3. Gap at the END of the set
    (
    SELECT
      From_IP_Code,
      To_IP_Code,
      Country_Name
    FROM
      [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
    ,
    (
    SELECT
      PriorRangeEndIP+1 From_IP_Code,
      From_IP_Code-1 AS To_IP_Code,
      'NA' AS Country_Name -- Missing rages lower    FROM      From_IP
    from(
      SELECT
        From_IP_Code,
        To_IP_Code,
        Country_Name
        ,
        LAG(To_IP_Code,
          1,
          INTEGER(0)) OVER(
        ORDER BY
          From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
      FROM
        [playscape-proj:GeoIP.GeoIP])A
    WHERE
     From_IP_Code>PriorRangeEndIP+1) -- If gap  FROM  prior valid range IS >1 than its a gap to fill
      ,
    (
    SELECT
      MAX(To_IP_Code)+1 AS From_IP_Code,
      INTEGER (4311810304) AS To_IP_Code,
      'NA' AS Country_Name -- Missing rages higher tan Max To_IP
    FROM
      [playscape-proj:GeoIP.GeoIP])
    ) AS B
  ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
WHERE
  A1.client_ip_code>=B.From_IP_Code
  AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
  OR (A1.client_ip_code IS NULL
    AND B.From_IP_Code=1 ) -- WHERE  there IS no valid IP ON left contributor_ip;
Lemuellemuela answered 22/11, 2013 at 23:47 Comment(3)
Thank you for your comprehensive answer! .1. I would put the right side in a static table. 2. I am afraid that "From_IP_Code/(256*256*256) as One" wont work fine as there are some ranges that spread between 2 or even 3 values. 3. skipping nulls in the join is tricky as it might cause dropping records in the final data set (but this can be handled with a wise join condition). I read that you guys are about to add some flexibility to join conditions soon. I hope it will simplify such queries.Stratton
Yes: the issue of the range that spread 2 or 3 values is easy to solve. Fortunately there are only around 5 of these cases. The easiest fix would be to add these few rules to cover the gap. For example, if a range goes from 4.0.0.0 to 5.255.255.255, you could duplicate it to have 4.0.0.0 to 4.255.255.255 and 5.0.0.0 to 5.255.255.255. There are only few cases to cover, and for this low investment you get much more faster queries later on.Lemuellemuela
For anyone using standard SQL, Net functions exist to replace legacy functions like PARSE_IP().Chemulpo
L
7

As a cool addendum (see previous answer to get the details): What are the top countries contributing edits to Wikipedia?

Row Country_Name    c    
1   United States   36605405     
2   United Kingdom  10355936     
3   Canada          4988835  
4   Australia       3387582  
5   India           1447756  
6   Germany         1414713 
7   Philippines     765874   
8   Netherlands     668850   
9   Ireland         651370   
10  France          602113   
11  New Zealand     590554   
12  Sweden          556544
....
Query complete (28.5s elapsed, 1.07 GB processed)

Query:

SELECT Country_Name, COUNT(*) c
FROM (
SELECT
 id,
 client_ip,
 client_ip_code,
 B.Country_Name AS Country_Name
FROM (
 SELECT
   id,
   contributor_ip AS client_ip,
   INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
   INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One
 FROM
   [publicdata:samples.wikipedia]
 WHERE contributor_ip IS NOT NULL
 -- NO LIMITS - use ALL the data!
   ) AS A1
JOIN
 (
 SELECT
   From_IP_Code,
   To_IP_Code,
   Country_Name,
   INTEGER(From_IP_Code/(256*256*256)) AS One
 FROM
   --3 IP sets: 1.valid ranges,  2.Gaps,  3. Gap at the END of the set
   (
   SELECT
     From_IP_Code,
     To_IP_Code,
     Country_Name
   FROM
     [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
   ,
   (
   SELECT
     PriorRangeEndIP+1 From_IP_Code,
     From_IP_Code-1 AS To_IP_Code,
     'NA' AS Country_Name -- Missing rages lower    FROM      From_IP
   from(
     SELECT
       From_IP_Code,
       To_IP_Code,
       Country_Name,
       LAG(To_IP_Code,
         1,
         INTEGER(0)) OVER(
       ORDER BY
         From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
     FROM
       [playscape-proj:GeoIP.GeoIP])A
   WHERE
    From_IP_Code>PriorRangeEndIP+1) -- If gap  FROM  prior valid range IS >1 than its a gap to fill
     ,
   (
   SELECT
     MAX(To_IP_Code)+1 AS From_IP_Code,
     INTEGER (4311810304) AS To_IP_Code,
     'NA' AS Country_Name -- Missing rages higher tan Max To_IP
   FROM
     [playscape-proj:GeoIP.GeoIP])
   ) AS B
 ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
WHERE
 A1.client_ip_code>=B.From_IP_Code
 AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
 OR (A1.client_ip_code IS NULL
   AND B.From_IP_Code=1 ) -- WHERE  there IS no valid IP ON left contributor_ip;
)
GROUP BY 1 ORDER BY 2 DESC
Lemuellemuela answered 23/11, 2013 at 0:10 Comment(4)
This is great example of using ip's with BQ. I am not able to see the table [playscape-proj:GeoIP.GeoIP] in public datasets. I wanted to run the above queries. Can you please guide me. ThanksUplift
see the query at cloudplatform.googleblog.com/2014/03/…Lemuellemuela
May i know what is the accuracy of this GeoIP conversion to city, longitude-latitude. Can we rely on it or is it approximation of the location based on IP.Uplift
Is it possible to do the same things with most recent data which is available in CSV format now - Quoting you - "There is a newer database available too, but I didn't work with it as it's only available in binary form for now. " ThanksUplift
I
1

Inspired by https://medium.com/@hoffa/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2 I came up with this solution that handles the ipv6 addresses also:

WITH test_data AS (
    SELECT '2a02:2f0c:570c:fe00:1db7:21c4:21fa:f89' AS ip UNION ALL 
    SELECT '79.114.150.111' AS ip
)
-- replace the input_data with your data
, ipv4 AS (
    SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
    FROM test_data 
    WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
), ipv4d AS (
    SELECT ip, city_name, country_name, latitude, longitude
    FROM (
        SELECT ip, ip_bytes & NET.IP_NET_MASK(4, mask) network_bin, mask
        FROM ipv4, UNNEST(GENERATE_ARRAY(8,32)) mask
    )
    JOIN `demo_bq_dataset.geoip_city_v4`
    USING (network_bin, mask)
), ipv6 AS (
    SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
    FROM test_data 
    WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 16
), ipv6d AS (
    SELECT ip, city_name, country_name, latitude, longitude
    FROM (
        SELECT  ip, ip_bytes & NET.IP_NET_MASK(16, mask) network_bin, mask
        FROM ipv6, UNNEST(GENERATE_ARRAY(19,64)) mask
    )
    JOIN `demo_bq_dataset.geoip_city_v6`  
    USING (network_bin, mask)
)
SELECT * FROM ipv4d
UNION ALL 
SELECT * FROM ipv6d

In order to get the geoip_city_v4 and geoip_city_v6 you need to download the geoip database from https://maxmind.com/

For a complete solution you can follow this tutorial in to update and prepare you dataset https://hodo.dev/posts/post-37-gcp-bigquery-geoip/.

Indicative answered 5/2, 2022 at 18:23 Comment(1)
The corect link to the article is hodo.dev/posts/post-37-gcp-bigquery-geoipIndicative

© 2022 - 2024 — McMap. All rights reserved.