Best choice for in memory data structure for IP address filter in Java

Asked 29/11, 2011 at 19:4 Answered 8/3, 2013 at 18:44

I have file that is CIDR format like this 192.168.1.0/24 and it is converted into this two column strucutre

3232236030 3232235777

Each string IP address convertion happens with this code:

String subnet = "192.168.1.0/24";
SubnetUtils utils = new SubnetUtils(subnet);

Inet4Address a = (Inet4Address) InetAddress.getByName(utils.getInfo().getHighAddress());
long high = bytesToLong(a.getAddress());
Inet4Address b = (Inet4Address) InetAddress.getByName(utils.getInfo().getLowAddress());
long low = bytesToLong(b.getAddress());

private static long bytesToLong(byte[] address) {
   long ipnum = 0;
   for (int i = 0; i < 4; ++i) {
       long y = address[i];
       if (y < 0) {
           y += 256;
       }
       ipnum += y << ((3 - i) * 8);
   }
   return ipnum;
}

Consider that there are over 5 million entries of (low high : 3232236030 3232235777).
Also there will be intersects so the IP can originate from multiple ranges. Just the first one is more than OK.
The data is read only.
What would be the fastest way to find the range the ipToBefiltered belongs to? The structure will be entirely in memory so no database lookups.

UPDATE:

I found this Peerblock project (it has over million download so I'm thinking it must have some fast algorithms): http://code.google.com/p/peerblock/source/browse/trunk/src/pbfilter/filter_wfp.c

Does anyone know what technique is the project using for creating the list of ranges and than searching them?

Gibbie answered 29/11, 2011 at 19:4 Comment(3)

"The structure will be entirely in memory so no database lookups." - Why not have an in-memory database? – Kotta 29/11, 2011 at 19:15

find the range the ipToBefiltered belongs to? You want to know the range(s) the given IP is in, not just T/F whether or not it's in some defined range? – Belton 29/11, 2011 at 19:36

@Mat Are there overlaps in the ranges? – Gally 29/11, 2011 at 21:36

When it comes down to it I just need to know if the IP is present in any of the 5M ranges.

I would consider an n-ary tree, where n=256, and work from the dotted address rather than the converted integer.

The top level would be an array of 256 objects. A null entry means "No" there is no range that contains the address, so given your example 192.168.1.0/24 array[192] would contain an object, but array[100] might be null because no range was defined for any 100.x.x.x/n

The stored object contains a (reference to) another array[256] and a range specifier, only one of the two would be set, so 192.0.0.0/8 would end up with a range specifier indicating all addresses within that range are to be filtered. This would allow for things like 192.255.0.0/10 where the first 10 bits of the address are significant 1100 0000 11xx xxxx -- otherwise you need to check the next octet in the 2nd level array.

Initially coalescing overlapping ranges, if any, into larger ranges... e.g. 3 .. 10 and 7 .. 16 becomes 3 .. 16 ... allows this, since you don't need to associate a given IP with which range defined it.

This should require no more than 8 comparisons. Each octet is initially used directly as an index, followed by a compare for null, a compare for terminal-node (is it a range or a pointer to the next tree level)

Worst case memory consumption is theoretically 4 GB (256 ^ 4) if every IP address was in a filtering range, but of course that would coalesce into a single range so actually would be only 1 range object. A more realistic worst-case would probably be more like (256 ^ 3) or 16.7 MB. Real world usage would probably have the majority of array[256] nodes at each level empty.

This is essentially similar to Huffman / prefix coding. The shortest distinct prefix can terminate as soon as an answer (a range) is found, so often you would have averages of < 4 compares.

Belton answered 29/11, 2011 at 22:2 Comment(0)

I would use a sorted array of int (the base address) and another array the same size (the end address). This would use 5M * 8 = 40 MB. The first IP is the base and the second IP is the last address in range. You would need to remove intersections.

To find if an address is filtered to a binary search O(log N) and if not an exact match, check it is less than (or equal to) the upper bound.

Skiest answered 29/11, 2011 at 19:42 Comment(0)

I found this binary chop algorithm in Vuze (aka azureus) project:

public IpRange isInRange(long address_long) {
    checkRebuild();

    if (mergedRanges.length == 0) {
        return (null);
    }

    // assisted binary chop

    int bottom = 0;
    int top = mergedRanges.length - 1;
    int current = -1;

    while (top >= 0 && bottom < mergedRanges.length && bottom <= top) {

        current = (bottom + top) / 2;

        IpRange e = mergedRanges[current];

        long this_start = e.getStartIpLong();
        long this_end = e.getMergedEndLong();

        if (address_long == this_start) {
            break;
        } else if (address_long > this_start) {

            if (address_long <= this_end) {
                break;
            }

            // lies to the right of this entry

            bottom = current + 1;

        } else if (address_long == this_end) {
            break;
        } else {
            // < this_end

            if (address_long >= this_start) {
                break;
            }
            top = current - 1;
        }
    }

    if (top >= 0 && bottom < mergedRanges.length && bottom <= top) {

        IpRange e = mergedRanges[current];

        if (address_long <= e.getEndIpLong()) {
            return (e);
        }

        IpRange[] merged = e.getMergedEntries();

        if (merged == null) {
            //inconsistent merged details - no entries
            return (null);
        }

        for (IpRange me : merged) {
            if (me.getStartIpLong() <= address_long && me.getEndIpLong() >= address_long) {
                return (me);
            }
        }
    }
    return (null);
}

Seems to be performing pretty well. If you know about something faster please let me know.

Gibbie answered 30/11, 2011 at 19:25 Comment(0)

If you just have a CIDR address (or a list of them) and you want to check if some ipAddress is in the range of that CIDR (or list of CIDR's), just define a Set of SubnetUtils objects.

Unless you are filtering a very large N addresses, this is all String comparison and will execute extremely fast. You dont need to build a binary tree based on the higher/lower order bits and all of that complicated Jazz.

String subnet = "192.168.1.0/24";
SubnetUtils utils = new SubnetUtils(subnet);
//...
//for each subnet, create a SubnetUtils object
Set<SubnetUtils> subnets = getAllSubnets();
//...

Use a Guava Predicate to filter the ipAddresses that are not in the range of your set of subnets:

   Set<String> ipAddresses = getIpAddressesToFilter();
   Set<String> ipAddressesInRange = 
       Sets.filter(ipAddresses, filterIpsBySubnet(subnets))


   Predicate<String> filterIpsBySubnet(final Set<SubnetUtils> subnets){
       return new Predicate<String>() {
            @Override
            public boolean apply(String ipAddress) {
                for (SubnetUtils subnet : subnets) {
                    if (subnet.getInfo().isInRange(ipAddress)) {
                        return true;
                    }
                }
                return false;
            }
        };
   }

Now if the IP is in any of the Subnets, you have a nice simple filter and you dont have to build a data structure that you will have to unit test. If this is not performant enough, then go to optimization. Don't prematurely optimize :)

Beeman answered 8/3, 2013 at 18:44 Comment(0)

Here is the beginning of an answer, I'll come back when I get more freetime

Setup:

Sort the ranges by the starting number.
Since these are IP Addresses, I assume that none of the ranges overlap. If there are overlaps, you should probably run the list merging ranges and trimming unnecessary ranges (ex. if you have a range 1 - 10, you can trim the range 5 - 7).
1. To merge or trim do this (assume range a immediately precedes range b):
  1. If b.end < a.end then range b is a subset of range a and you can remove range b.
  2. If b.start < b.end and b.end > a.end then you can merge range a and b. Set a.end = b.end then remove range b.

Diverse answered 29/11, 2011 at 20:3 Comment(0)

UPDATE:

Does anyone know what technique is the project using for creating the list of ranges and than searching them?

Recommended topics

Hot tags