Python - parse IPv4 addresses from string (even when censored)
Asked Answered
S

4

6

Objective: Write Python 2.7 code to extract IPv4 addresses from string.

String content example:


The following are IP addresses: 192.168.1.1, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. and these censorship methods could apply to any of the dots (Ex: 192[.]168[.]1[.]1).


As you can see from the above, I am struggling to find a way to parse through a txt file that may contain IPs depicted in multiple forms of "censorship" (to prevent hyper-linking).

I'm thinking that a regex expression is the way to go. Maybe say something along the lines of; any grouping of four ints 0-255 or 000-255 separated by anything in the 'separators list' which would consist of periods, brackets, parenthesis, or any of the other aforementioned examples. This way, the 'separators list' could be updated at as needed.

Not sure if this is the proper way to go or even possible so, any help with this is greatly appreciated.


Update: Thanks to recursive's answer below, I now have the following code working for the above example. It will...

  • find the IPs
  • place them into a list
  • clean them of the spaces/braces/etc
  • and replace the uncleaned list entry with the cleaned one.

Caveat: The code below does not account for incorrect/non-valid IPs such as 192.168.0.256 or 192.168.1.2.3 Currently, it will drop the trailing 6 and 3 from the aforementioned. If its first octet is invalid (ex:256.10.10.10) it will drop the leading 2 (resulting in 56.10.10.10).

import re

def extractIPs(fileContent):
    pattern = r"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)([ (\[]?(\.|dot)[ )\]]?(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3})"
    ips = [each[0] for each in re.findall(pattern, fileContent)]   
    for item in ips:
        location = ips.index(item)
        ip = re.sub("[ ()\[\]]", "", item)
        ip = re.sub("dot", ".", ip)
        ips.remove(item)
        ips.insert(location, ip) 
    return ips

myFile = open('***INSERT FILE PATH HERE***')
fileContent = myFile.read()

IPs = extractIPs(fileContent)
print "Original file content:\n{0}".format(fileContent)
print "--------------------------------"
print "Parsed results:\n{0}".format(IPs)
Suprematism answered 26/6, 2013 at 18:37 Comment(5)
It would be nice to post what you have tried so far and where you got stuck. That way we could improve your current solution and you may learn (more) from itPaulinapauline
At first I was splitting on spaces and had everything almost working perfectly but after I realized that sometimes spaces prefix the periods, I went back to the drawing board. So far, I have tried many different examples from StackOverflow but have only found ways to grab 'uncensored' IPs. For example, I tried splitting on the periods and then validating each element (re.match(r'^([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])$', part) I have never played with regex before and am kinda new at python so I am somewhat daunted by how to approach this.Suprematism
recursive has provided an answer. I'm a bit allergic to help vampires thus my reaction. If you replied earlier I could have answered you :)Paulinapauline
No worries HamZa. I appreciate it and understand about "help vampires". I may come across that way since I am not formally trained in programming (read 'total noob') and so sometimes have dumb questions or need a pointer in the correct direction to even formulate it in my mind. Recursive was extremely helpful and I have almost finished my code now.Suprematism
I see that you're putting efforts. That makes me happy, keep it going ! Ah and if recursive's answer was "the answer" then don't forget to accept his answer. When you get about 20rep, you may come by at a chatroom. There you can ask more freely and broadly.Paulinapauline
S
1

The code below will...

  • find IPs in strings even when censored (ex: 192.168.1[dot]20 or 10.10.10 .21)
  • place them into a list
  • clean them of the censorship (spaces/braces/parenthesis)
  • and replace the uncleaned list entry with the cleaned one.

Caveat: The code below does not account for incorrect/non-valid IPs such as 192.168.0.256 or 192.168.1.2.3 Currently, it will drop the trailing digit (6 and 3 from the aforementioned). If its first octet is invalid (ex: 256.10.10.10), it will drop the leading digit (resulting in 56.10.10.10).


import re

def extractIPs(fileContent):
    pattern = r"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)([ (\[]?(\.|dot)[ )\]]?(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3})"
    ips = [each[0] for each in re.findall(pattern, fileContent)]   
    for item in ips:
        location = ips.index(item)
        ip = re.sub("[ ()\[\]]", "", item)
        ip = re.sub("dot", ".", ip)
        ips.remove(item)
        ips.insert(location, ip) 
    return ips


myFile = open('***INSERT FILE PATH HERE***')
fileContent = myFile.read()

IPs = extractIPs(fileContent)
print "Original file content:\n{0}".format(fileContent)
print "--------------------------------"
print "Parsed results:\n{0}".format(IPs)

Suprematism answered 30/6, 2013 at 8:19 Comment(0)
C
9

Here is a regex that works:

import re
pattern = r"((([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])[ (\[]?(\.|dot)[ )\]]?){3}([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]))"
text = "The following are IP addresses: 192.168.1.1, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. "
ips = [match[0] for match in re.findall(pattern, text)]
print ips

# output: ['192.168.1.1', '8.8.8.8', '101.099.098.000', '192.168.1[.]1', '192.168.1(.)1', '192.168.1[dot]1', '192.168.1(dot)1', '192 .168 .1 .1', '192. 168. 1. 1']

The regex has a few main parts, which I will explain here:

  • ([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])
    This matches the numerical parts of the ip address. | means "or". The first case handles numbers from 0 to 199 with or without leading zeroes. The second two cases handle numbers over 199.
  • [ (\[]?(\.|dot)[ )\]]?
    This matches the "dot" parts. There are three sub-components:
    • [ (\[]? The "prefix" for the dot. Either a space, an open paren, or open square brace. The trailing ? means that this part is optional.
    • (\.|dot) Either "dot" or a period.
    • [ )\]]? The "suffix". Same logic as the prefix.
  • {3} means repeat the previous component 3 times.
  • The final element is another number, which is the same as the first, except it is not followed by a dot.
Cerda answered 26/6, 2013 at 19:1 Comment(13)
Why don't you stay solidary with me and wait until he provides what he has tried ?Paulinapauline
I think the bottom two paragraphs of the question sufficiently explain the current state of nephos' progress. It may be that there is no code, but it's clear that some thought has been invested, so that's ok. Since there is enough information to provide a hopefully helpful answer, I see no reason not to.Cerda
I shall say: congratulation !Paulinapauline
@Cerda Would you mind explaining how that regular expression works?Bullivant
@recursive: I've been thinking about buying a car, will you buy it for me? I doubt it. Thinking of a solution and trying it are two different things.Sunderance
@MadaraUchiha: I will tell you how to buy a car. I will not actually pay for it. Similarly, I will tell you how to apply a regex, but I won't support your software.Cerda
@BenjaminGruenbaum: I added an explanation.Cerda
@Cerda Thanks :) That makes the answer much better.Bullivant
Thank you everyone for the quick reply. The regex explanation offers me valuable insight. I would really like to learn to do this on my own as opposed to always trying to find pre-written code or ask for help. The above code does not remove the 'censorship elements' and return valid IPs. The result should eliminate the brackets and the such and replace it with dots. I will try to take what I’ve learned from the above and tweak until I get the desired results (unless anyone else wants to chime in). Thanks again for the quick assist.Suprematism
@nephos: It should be relatively straightforward to apply the same techniques explained in the answer to extract the "uncensored" version. If you get stuck, let us know.Cerda
@Cerda Thanks again for your assistance. I am now starting to understand regex and have workable code that I updated the question with. It may be a bit sloppy but it works and that's what I can do for now. If you wish, please feel free to let me know how it can be improved.Suprematism
@recursive: Sorry. Have to bother you again. The updated code above stops matching the last octet at 199 (dropping the last digit for anything higher). 192.168.1.199 = 192.168.1.199 / 192.168.1.200 = 192.168.1.20 / 255.255.255.255 = 255.255.255.25 I tried using www.rubular.com/r/XzRJrrnaLt as well as regexr.com?35dbf and always got the same result. Tried testing other ways such as changing [01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5] to [0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5] Still a no go. Any further suggestions are much appreciated.Suprematism
@recursive: Please disregard my last message. After being awake for way too long, I finally found an acceptable answer. Thanks again for all your assistance.Suprematism
I
3

Description

This regex will match each of four octets of a what looks like an IP address. Each of the octets will be placed into it's own capture group for collection.

(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])

enter image description here

Given the following sample text this regex will match all 10 embedded IP strings in their entirety including the first one. Working example: http://www.rubular.com/r/1MbGZOhuj5

The following are IP addresses: 192.168.1.222, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. and these censorship methods could apply to any of the dots (Ex: 192[.]168[.]1[.]1).

The resulting matches could be iterated over and a properly formatted IP string could be constructed by joining the 4 capture groups with a dot.

Ingvar answered 27/6, 2013 at 5:6 Comment(1)
Awesome resource Denomales! I will definitely be bookmarking rubular.com and using it to learn more about how regex works. Thanks! I like your approach and it most definitely works for the example I provided. I have another example that is a bit more "messy" that it does not work for but I will get on this again over the weekend and find a full solution and post it here when done. Thanks again for the great link. In case you are wondering what I mean by "messy"... "(132) - 10.10.10.10 (2.31Mb)" gets parsed to 132.10.10.10Suprematism
S
1

The code below will...

  • find IPs in strings even when censored (ex: 192.168.1[dot]20 or 10.10.10 .21)
  • place them into a list
  • clean them of the censorship (spaces/braces/parenthesis)
  • and replace the uncleaned list entry with the cleaned one.

Caveat: The code below does not account for incorrect/non-valid IPs such as 192.168.0.256 or 192.168.1.2.3 Currently, it will drop the trailing digit (6 and 3 from the aforementioned). If its first octet is invalid (ex: 256.10.10.10), it will drop the leading digit (resulting in 56.10.10.10).


import re

def extractIPs(fileContent):
    pattern = r"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)([ (\[]?(\.|dot)[ )\]]?(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3})"
    ips = [each[0] for each in re.findall(pattern, fileContent)]   
    for item in ips:
        location = ips.index(item)
        ip = re.sub("[ ()\[\]]", "", item)
        ip = re.sub("dot", ".", ip)
        ips.remove(item)
        ips.insert(location, ip) 
    return ips


myFile = open('***INSERT FILE PATH HERE***')
fileContent = myFile.read()

IPs = extractIPs(fileContent)
print "Original file content:\n{0}".format(fileContent)
print "--------------------------------"
print "Parsed results:\n{0}".format(IPs)

Suprematism answered 30/6, 2013 at 8:19 Comment(0)
E
0

Extract and Categorize IPv4 Addresses (Even When Censored)

Note: This is just an implementation of a class I wrote for extracting IPv4 Addresses. I will likely update my class with a method for this functionality in the future. You can find it on my GitHub page.


What I'm demonstrating below is the following:

  1. Cleaning up your string content example

  2. Bringing your string data into a list

  3. Using the ExtractIPs() class to parse and categorize IPv4 Addresses

    • This class returns a dictionary containing 4 lists:

      • Valid IPv4 Addresses

      • Public IPv4 Addresses

      • Private IPv4 Addresses

      • Invalid IPv4 Addresses


  • ExtractIPs class

    #!/usr/bin/env python
    
    """Extract and Classify IP Addresses."""
    
    import re  # Use Regular Expressions.
    
    
    __program__ = "IPAddresses.py"
    __author__ = "Johnny C. Wachter"
    __copyright__ = "Copyright (C) 2014 Johnny C. Wachter"
    __license__ = "MIT"
    __version__ = "0.0.1"
    __maintainer__ = "Johnny C. Wachter"
    __contact__ = "[email protected]"
    __status__ = "Development"
    
    
    class ExtractIPs(object):
    
        """Extract and Classify IP Addresses From Input Data."""
    
        def __init__(self, input_data):
            """Instantiate the Class."""
    
            self.input_data = input_data
    
            self.ipv4_results = {
                'valid_ips': [],  # Store all valid IP Addresses.
                'invalid_ips': [],  # Store all invalid IP Addresses.
                'private_ips': [],  # Store all Private IP Addresses.
                'public_ips': []  # Store all Public IP Addresses.
            }
    
        def extract_ipv4_like(self):
            """Extract IP-like strings from input data.
            :rtype : list
            """
    
            ipv4_like_list = []
    
            ip_like_pattern = re.compile(r'([0-9]{1,3}\.){3}([0-9]{1,3})')
    
            for entry in self.input_data:
    
                if re.match(ip_like_pattern, entry):
    
                    if len(entry.split('.')) == 4:
    
                        ipv4_like_list.append(entry)
    
            return ipv4_like_list
    
        def validate_ipv4_like(self):
            """Validate that IP-like entries fall within the appropriate range."""
    
            if self.extract_ipv4_like():
    
                # We're gonna want to ignore the below two addresses.
                ignore_list = ['0.0.0.0', '255.255.255.255']
    
                # Separate the Valid from Invalid IP Addresses.
                for ipv4_like in self.extract_ipv4_like():
    
                    # Split the 'IP' into parts so each part can be validated.
                    parts = ipv4_like.split('.')
    
                    # All part values should be between 0 and 255.
                    if all(0 <= int(part) < 256 for part in parts):
    
                        if not ipv4_like in ignore_list:
    
                            self.ipv4_results['valid_ips'].append(ipv4_like)
    
                    else:
    
                        self.ipv4_results['invalid_ips'].append(ipv4_like)
    
            else:
                pass
    
        def classify_ipv4_addresses(self):
            """Classify Valid IP Addresses."""
    
            if self.ipv4_results['valid_ips']:
    
                # Now we will classify the Valid IP Addresses.
                for valid_ip in self.ipv4_results['valid_ips']:
    
                    private_ip_pattern = re.findall(
    
                        r"""^10\.(\d{1,3}\.){2}\d{1,3}
    
                        (^127\.0\.0\.1)|  # Loopback
    
                        (^10\.(\d{1,3}\.){2}\d{1,3})|  # 10/8 Range
    
                        # Matching the 172.16/12 Range takes several matches
                        (^172\.1[6-9]\.\d{1,3}\.\d{1,3})|
                        (^172\.2[0-9]\.\d{1,3}\.\d{1,3})|
                        (^172\.3[0-1]\.\d{1,3}\.\d{1,3})|
    
                        (^192\.168\.\d{1,3}\.\d{1,3})|  # 192.168/16 Range
    
                        # Match APIPA Range.
                        (^169\.254\.\d{1,3}\.\d{1,3})
    
                        # VERBOSE for a clean look of this RegEx.
                        """, valid_ip, re.VERBOSE
                    )
    
                    if private_ip_pattern:
    
                        self.ipv4_results['private_ips'].append(valid_ip)
    
                    else:
                        self.ipv4_results['public_ips'].append(valid_ip)
    
            else:
                pass
    
        def get_ipv4_results(self):
            """Extract and classify all valid and invalid IP-like strings.
            :returns : dict
            """
    
            self.extract_ipv4_like()
            self.validate_ipv4_like()
            self.classify_ipv4_addresses()
    
            return self.ipv4_results
    
  • Example Extraction With Censorship

    censored = re.compile(
        r"""
    
        \(\.\)|
        \(dot\)|
        \[\.\]|
        \[dot\]|
        ( \.)
    
        """, re.VERBOSE | re.IGNORECASE
    )
    
    data_list = input_string.split()  # Bring your input string to a list.
    
    clean_list = []  # List to store the cleaned up input.
    
    for entry in data_list:
    
        # Remove undesired leading and trailing characters.
        clean_entry = entry.strip(' .,<>?/[]\\{}"\'|`~!@#$%^&*()_+-=')
    
        clean_list.append(clean_entry)  # Add the entry to the clean list.
    
    clean_unique_list = list(set(clean_list))  # Remove duplicates in list.
    
    # Now we can go ahead and extract IPv4 Addresses. Note that this will be a dict.
    results = ExtractIPs(clean_list).get_ipv4_results()
    
    for k, v in results.iteritems():
    
        # After all that work, make sure the results are nicely presented!
        print("\n%s: %s" % (k, v))
    
    • Results:

      public_ips: ['8.8.8.8', '101.099.098.000']
      
      valid_ips: ['192.168.1.1', '8.8.8.8', '101.099.098.000']
      
      invalid_ips: []
      
      private_ips: ['192.168.1.1']
      
Exactly answered 9/6, 2014 at 4:36 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.