GeoIP table join with table of IP's in MySQL
Asked Answered
L

4

6

I am having a issue finding a fast way of joining the tables looking like that:

mysql> explain geo_ip;
+--------------+------------------+------+-----+---------+-------+
| Field        | Type             | Null | Key | Default | Extra |
+--------------+------------------+------+-----+---------+-------+
| ip_start     | varchar(32)      | NO   |     | ""      |       |
| ip_end       | varchar(32)      | NO   |     | ""      |       |
| ip_num_start | int(64) unsigned | NO   | PRI | 0       |       |
| ip_num_end   | int(64) unsigned | NO   |     | 0       |       |
| country_code | varchar(3)       | NO   |     | ""      |       |
| country_name | varchar(64)      | NO   |     | ""      |       |
| ip_poly      | geometry         | NO   | MUL | NULL    |       |
+--------------+------------------+------+-----+---------+-------+


mysql> explain entity_ip;
+------------+---------------------+------+-----+---------+-------+
| Field      | Type                | Null | Key | Default | Extra |
+------------+---------------------+------+-----+---------+-------+
| entity_id  | int(64) unsigned    | NO   | PRI | NULL    |       |
| ip_1       | tinyint(3) unsigned | NO   |     | NULL    |       |
| ip_2       | tinyint(3) unsigned | NO   |     | NULL    |       |
| ip_3       | tinyint(3) unsigned | NO   |     | NULL    |       |
| ip_4       | tinyint(3) unsigned | NO   |     | NULL    |       |
| ip_num     | int(64) unsigned    | NO   |     | 0       |       |
| ip_poly    | geometry            | NO   | MUL | NULL    |       |
+------------+---------------------+------+-----+---------+-------+

Please note that I am not interested in finding the needed rows in geo_ip by only ONE IP address at once, I need a entity_ip LEFT JOIN geo_ip (or similar/analogue way).

This is what I have for now (using polygons as advised on http://jcole.us/blog/archives/2007/11/24/on-efficiently-geo-referencing-ips-with-maxmind-geoip-and-mysql-gis/):

mysql> EXPLAIN SELECT li.*, gi.country_code FROM entity_ip AS li
-> LEFT JOIN geo_ip AS gi ON
-> MBRCONTAINS(gi.`ip_poly`, li.`ip_poly`);

+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows   | Extra |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
|  1 | SIMPLE      | li    | ALL  | NULL          | NULL | NULL    | NULL |   2470 |       |
|  1 | SIMPLE      | gi    | ALL  | ip_poly_index | NULL | NULL    | NULL | 155183 |       |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+

mysql> SELECT li.*, gi.country_code FROM entity AS li LEFT JOIN geo_ip AS gi ON MBRCONTAINS(gi.`ip_poly`, li.`ip_poly`) limit  0, 20;
20 rows in set (2.22 sec)

No polygons

mysql> explain SELECT li.*, gi.country_code FROM entity_ip AS li LEFT JOIN geo_ip AS gi ON li.`ip_num` >= gi.`ip_num_start` AND li.`ip_num` <= gi.`ip_num_end` LIMIT 0,20;
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys             | key  | key_len | ref  | rows   | Extra |
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
|  1 | SIMPLE      | li    | ALL  | NULL                      | NULL | NULL    | NULL |   2470 |       |
|  1 | SIMPLE      | gi    | ALL  | PRIMARY,geo_ip,geo_ip_end | NULL | NULL    | NULL | 155183 |       |
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+

mysql> SELECT li.*, gi.country_code FROM entity_ip AS li LEFT JOIN geo_ip AS gi ON li.ip_num BETWEEN gi.ip_num_start AND gi.ip_num_end limit  0, 20;
20 rows in set (2.00 sec)

(On higher number of rows in the search - there is no difference)

Currently I cannot get any faster performance from these queries as 0.1 seconds per IP is way too slow for me.

Is there any way to make it faster?

Louettalough answered 19/11, 2011 at 16:10 Comment(7)
Shot in the dark: any chance that an index on entity_ip's ip_num would improve the speed of the second query?Setup
A must to do it inside MySQL? If we treat ip_num_start and ip_num_end as associated points, and reading entity_ip.ip_num in sorted manner as x-coord of a sweeping line across the dots, the concept of sweep line algorithm may give you a faster run than the n-by-m left join inside MySQL.Lobelia
Don't know about author's case, for me (and many people) it would be very interesting to see mysql only solution.Cassey
First eveident fix is to use POINT insead of GEOMETRY datatype in entity_ip.ip_poly. Next: mysql may not be using indexes in your query, because you select all rows from entity_ip and the query optimizer decides to make fullscan.Cassey
Oroboros102, POINT does not work, I was starting with POINT, actually :)Louettalough
Robert Martin, unfortunately, index is already there :)Louettalough
@AndrejCizov what kind of geometry is in entity_ip table?Cassey
P
6

This approach has some scalability issues (should you choose to move to, say, city-specific geoip data), but for the given size of data, it will provide considerable optimization.

The problem you are facing is effectively that MySQL does not optimize range-based queries very well. Ideally you want to do an exact ("=") look-up on an index rather than "greater than", so we'll need to build an index like that from the data you have available. This way MySQL will have much fewer rows to evaluate while looking for a match.

To do this, I suggest that you create a look-up table that indexes the geolocation table based on the first octet (=1 from 1.2.3.4) of the IP addresses. The idea is that for each look-up you have to do, you can ignore all geolocation IPs which do not begin with the same octet than the IP you are looking for.

CREATE TABLE `ip_geolocation_lookup` (
  `first_octet` int(10) unsigned NOT NULL DEFAULT '0',
  `ip_numeric_start` int(10) unsigned NOT NULL DEFAULT '0',
  `ip_numeric_end` int(10) unsigned NOT NULL DEFAULT '0',
  KEY `first_octet` (`first_octet`,`ip_numeric_start`,`ip_numeric_end`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Next, we need to take the data available in your geolocation table and produce data that covers all (first) octets the geolocation row covers: If you have an entry with ip_start = '5.3.0.0' and ip_end = '8.16.0.0', the lookup table will need rows for octets 5, 6, 7, and 8. So...

ip_geolocation
|ip_start       |ip_end          |ip_numeric_start|ip_numeric_end|
|72.255.119.248 |74.3.127.255    |1224701944      |1241743359    |

Should convert to:

ip_geolocation_lookup
|first_octet|ip_numeric_start|ip_numeric_end|
|72         |1224701944      |1241743359    |
|73         |1224701944      |1241743359    |
|74         |1224701944      |1241743359    |

Since someone here requested for a native MySQL solution, here's a stored procedure that will generate that data for you:

DROP PROCEDURE IF EXISTS recalculate_ip_geolocation_lookup;

CREATE PROCEDURE recalculate_ip_geolocation_lookup()
BEGIN
    DECLARE i INT DEFAULT 0;

    DELETE FROM ip_geolocation_lookup;

    WHILE i < 256 DO
       INSERT INTO ip_geolocation_lookup (first_octet, ip_numeric_start, ip_numeric_end) 
                SELECT  i, ip_numeric_start, ip_numeric_end FROM ip_geolocation WHERE 
                ( ip_numeric_start & 0xFF000000 ) >> 24 <= i AND 
                ( ip_numeric_end & 0xFF000000 ) >> 24 >= i;

       SET i = i + 1;
    END WHILE;
END;

And then you will need to populate the table by calling that stored procedure:

CALL recalculate_ip_geolocation_lookup();

At this point you may delete the procedure you just created -- it is no longer needed, unless you want to recalculate the look-up table.

After the look-up table is in place, all you have to do is integrate it into your queries and make sure you're querying by the first octet. Your query to the look-up table will satisfy two conditions:

  1. Find all rows which match the first octet of your IP address
  2. Of that subset: Find the row which has the the range that matches your IP address

Because the step two is carried out on a subset of data, it is considerably faster than doing the range tests on the entire data. This is the key to this optimization strategy.

There are various ways for figuring out what the first octet of an IP address is; I used ( r.ip_numeric & 0xFF000000 ) >> 24 since my source IPs are in numeric form:

SELECT 
    r.*, 
    g.country_code
FROM 
    ip_geolocation g,
    ip_geolocation_lookup l,
    ip_random r
WHERE 
    l.first_octet = ( r.ip_numeric & 0xFF000000 ) >> 24 AND 
    l.ip_numeric_start <= r.ip_numeric AND      
    l.ip_numeric_end >= r.ip_numeric AND 
    g.ip_numeric_start = l.ip_numeric_start;

Now, admittedly I did get a little lazy in the end: You could easily get rid of ip_geolocation table altogether if you made the ip_geolocation_lookup table also contain the country data. I'm guessing dropping one table from this query would make it a bit faster.

And, finally, here are the two other tables I used in this response for reference, since they differ from your tables. I'm certain they are compatible, though.

# This table contains the original geolocation data

CREATE TABLE `ip_geolocation` (
  `ip_start` varchar(16) NOT NULL DEFAULT '',
  `ip_end` varchar(16) NOT NULL DEFAULT '',
  `ip_numeric_start` int(10) unsigned NOT NULL DEFAULT '0',
  `ip_numeric_end` int(10) unsigned NOT NULL DEFAULT '0',
  `country_code` varchar(3) NOT NULL DEFAULT '',
  `country_name` varchar(64) NOT NULL DEFAULT '',
  PRIMARY KEY (`ip_numeric_start`),
  KEY `country_code` (`country_code`),
  KEY `ip_start` (`ip_start`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


# This table simply holds random IP data that can be used for testing

CREATE TABLE `ip_random` (
  `ip` varchar(16) NOT NULL DEFAULT '',
  `ip_numeric` int(10) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Photophilous answered 22/11, 2011 at 17:40 Comment(5)
Wow, extremly detailed answer. Please, give me a pair of days to test this approach. Looks, like working solution.Cassey
This query is much faster, than fullscan, but still it needs to scan a lot of rows (ranges_qty / 255). If we'll use geo ip per city range table (30 000 000 rows) this query will be slow. I found some soluton, that uses geometry. If one my question will get any apropriate answer (#8245035), I will have better solution for this question. If not - your answer is going to be the best one.Cassey
The question was actually different. The INNER JOIN works fine, while LEFT JOIN will take at least 4 minutes on 2k entity_ip table.Louettalough
You could just build a bigger look-up table. The answer above effectively ignores last 24 of the 32 bits of the IP address. Building the look-up table with, say, the first two octets will improve the query speed on large data sets and still keep the look-up table at a reasonable size.Photophilous
Thank you for this; the first octet solution was just the thing that was needed for me to query on millions of rows optimally.Tall
T
1

Just wanted to give back to the community:

Here's an even better and optimized way building on Aleksi's solution:

DROP PROCEDURE IF EXISTS recalculate_ip_geolocation_lookup;

DELIMITER ;;
CREATE PROCEDURE recalculate_ip_geolocation_lookup()
BEGIN
  DECLARE i INT DEFAULT 0;
DROP TABLE `ip_geolocation_lookup`;

CREATE TABLE `ip_geolocation_lookup` (
  `first_octet` smallint(5) unsigned NOT NULL DEFAULT '0',
  `startIpNum` int(10) unsigned NOT NULL DEFAULT '0',
  `endIpNum` int(10) unsigned NOT NULL DEFAULT '0',
  `locId` int(11) NOT NULL,
  PRIMARY KEY (`first_octet`,`startIpNum`,`endIpNum`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT IGNORE INTO ip_geolocation_lookup
SELECT  startIpNum DIV 1048576 as first_octet, startIpNum, endIpNum, locId
FROM ip_geolocation;

INSERT IGNORE INTO ip_geolocation_lookup
SELECT  endIpNum DIV 1048576 as first_octet, startIpNum, endIpNum, locId
FROM ip_geolocation;

  WHILE i < 1048576 DO
    INSERT IGNORE INTO ip_geolocation_lookup
        SELECT i, startIpNum, endIpNum, locId 
        FROM ip_geolocation_lookup 
        WHERE first_octet = i-1
        AND endIpNum DIV 1048576 > i;
   SET i = i + 1;
  END WHILE;
END;;
DELIMITER ;

CALL recalculate_ip_geolocation_lookup();

It builds way faster than his solution and drills down more easily because we're not just taking the first 8, but the first 20 bits. Join performance: 100000 rows in 158ms. You might have to rename the table and field names to your version.

Query by using

SELECT ip, kl.*
FROM random_ips ki
JOIN `ip_geolocation_lookup` kb ON (ki.`ip` DIV 1048576 = kb.`first_octet` AND ki.`ip` >= kb.`startIpNum` AND ki.`ip` <= kb.`endIpNum`)
JOIN ip_maxmind_locations kl ON kb.`locId` = kl.`locId`;
Teena answered 18/1, 2013 at 13:36 Comment(0)
B
1

Can't comment yet, but user1281376's answers is wrong and doesn't work. the reason you only use the first octet is because you aren't going to match all ip ranges otherwise. there's plenty of ranges that span multiple second octets which user1281376s changed query isn't going to match. And yes, this actually happens if you use the Maxmind GeoIp data.

with aleksis suggestion you can do a simple comparison on the fîrst octet, thus reducing the matching set.

Backwoodsman answered 10/3, 2014 at 13:2 Comment(2)
Probably I should've checked that but at that point of time I've decided to just skip it as it was working anyway (I remember I've also assumed the author has done his homework). ThanksLouettalough
right, it's obviously faster, though especially with maxmind's geoip table you won't match level 3 for example. took me a while to figure out the first time I ran into this. So you'll have to add another row for end_range and then you're still stuck with a range query. And even worse when you don't have a match for the ip, it'll scan the whole table.Backwoodsman
B
0

I found a easy way. I noticed that all first ip in the group % 256 = 0, so we can add a ip_index table

CREATE TABLE `t_map_geo_range` (
  `_ip` int(10) unsigned NOT NULL,
  `_ipStart` int(10) unsigned NOT NULL,
  PRIMARY KEY (`_ip`)
) ENGINE=MyISAM

How to fill the index table

FOR_EACH(Every row of ip_geo)
{
    FOR(Every ip FROM ipGroupStart/256 to ipGroupEnd/256)
    {
        INSERT INTO ip_geo_index(ip, ipGroupStart);
    }
}

How to use:

SELECT * FROM YOUR_TABLE AS A
LEFT JOIN ip_geo_index AS B ON B._ip = A._ip DIV 256
LEFT JOIN ip_geo AS C ON C.ipStart = B.ipStart;

More than 1000 times Faster.

Bunyip answered 8/5, 2017 at 19:22 Comment(1)
Please see answer above.Louettalough

© 2022 - 2024 — McMap. All rights reserved.