Regex to match Domain.CCTLD
Asked Answered
T

3

9

Does anyone know a regular expression to match Domain.CCTLD? I don't want subdomains, only the "atomic domain". For example, docs.google.com doesn't get matched, but google.com does. However, this gets complicated with stuff like .co.uk, CCTLDs. Does anyone know a solution? Thanks in advance.

EDIT: I've realized I also have to deal with multiple subdomains, like john.doe.google.co.uk. Need a solution now more than ever :P.

Terena answered 7/7, 2010 at 22:16 Comment(4)
Do you explicitly need a regex, or would a function to do it suffice?Overwind
This would become a quite large regex, seeing as you would need to treat all ccSLDs as special cases, and there are a lot (and I mean A LOT) of ccSLDs. Brazil has 66 of them!Thoria
@Benson, a function would work, as long as it could find domain.cctld in a long list of domainsTerena
possible duplicate of Get the subdomain from a URLAldercy
O
3

Based on your comment above, I'm going to reinterpret the question -- rather than making a regex that will match them, we'll create a function that will match them, and apply that function to filter a list of domain names to only include first class domains, e.g. google.com, amazon.co.uk.

First, we'll need a list of TLDs. As Greg mentioned, the public suffix list is a great place to start. Let's assume you've parsed the list into a python array called suffixes. If this isn't something your comfortable with, comment and I can add some code that will do it.

suffixes = parse_suffix_list("suffix_list.txt")

Now we'll need code that identifies whether a given domain name matches the pattern some-name.suffix:

def is_domain(d):
    for suffix in suffixes:
        if d.endswith(suffix):
            # Get the base domain name without suffix
            base_name = d[0:-(suffix.length + 1)]
            # If it contains '.', it's a subdomain. 
            if not base_name.contains('.'):
                return true
    # If we get here, no matches were found
    return false
Overwind answered 8/7, 2010 at 21:41 Comment(2)
Thanks! I can find my way from here.Terena
You can now use a simple but excellent python package to do the heavy lifting for this: pypi.python.org/pypi/publicsuffixVeradis
A
8

It sounds like you are looking for the information available through the Public Suffix List project.

A "public suffix" is one under which Internet users can directly register names. Some examples of public suffixes are ".com", ".co.uk" and "pvt.k12.wy.us". The Public Suffix List is a list of all known public suffixes.

There is no single regular expression that will reasonably match the list of public suffixes. You will need to implement code to use the public suffix list, or find an existing library that already does so.

Aldercy answered 7/7, 2010 at 22:23 Comment(4)
Interesting and probably very useful list.Thoria
Thanks, Greg. That's absolutely the right answer. There are libraries to do Public Suffix List processing in several languages at dkim-reputation.org/regdom-libsUndertone
@Anirvan, do you know an equivalent for Python? The library you posted is only available in C, PHP, and Perl.Terena
@Tom: Over a year later, here is a python package for the job: pypi.python.org/pypi/publicsuffixVeradis
O
3

Based on your comment above, I'm going to reinterpret the question -- rather than making a regex that will match them, we'll create a function that will match them, and apply that function to filter a list of domain names to only include first class domains, e.g. google.com, amazon.co.uk.

First, we'll need a list of TLDs. As Greg mentioned, the public suffix list is a great place to start. Let's assume you've parsed the list into a python array called suffixes. If this isn't something your comfortable with, comment and I can add some code that will do it.

suffixes = parse_suffix_list("suffix_list.txt")

Now we'll need code that identifies whether a given domain name matches the pattern some-name.suffix:

def is_domain(d):
    for suffix in suffixes:
        if d.endswith(suffix):
            # Get the base domain name without suffix
            base_name = d[0:-(suffix.length + 1)]
            # If it contains '.', it's a subdomain. 
            if not base_name.contains('.'):
                return true
    # If we get here, no matches were found
    return false
Overwind answered 8/7, 2010 at 21:41 Comment(2)
Thanks! I can find my way from here.Terena
You can now use a simple but excellent python package to do the heavy lifting for this: pypi.python.org/pypi/publicsuffixVeradis
O
2

I would probably solve this by getting a complete list of TLDs and using it to create the regex. For example (in Ruby, sorry, not a Pythonista yet):

tld_alternation = ['\.com','\.co\.uk','\.eu','\.org',...].join('|')
regex = /^[a-z0-9]([a-z0-9\-]*[a-z0-9])?(#{tld_alternation})$/i

I don't think it's possible to properly differentiate between a real two part TLD and a subdomain without knowing the actual list of TLDs (ie: you could always construct a subdomain that looks like a TLD if you knew how the regex worked.)

Opec answered 7/7, 2010 at 22:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.