Based on your comment above, I'm going to reinterpret the question -- rather than making a regex that will match them, we'll create a function that will match them, and apply that function to filter a list of domain names to only include first class domains, e.g. google.com, amazon.co.uk.
First, we'll need a list of TLDs. As Greg mentioned, the public suffix list is a great place to start. Let's assume you've parsed the list into a python array called suffixes
. If this isn't something your comfortable with, comment and I can add some code that will do it.
suffixes = parse_suffix_list("suffix_list.txt")
Now we'll need code that identifies whether a given domain name matches the pattern some-name.suffix:
def is_domain(d):
for suffix in suffixes:
if d.endswith(suffix):
# Get the base domain name without suffix
base_name = d[0:-(suffix.length + 1)]
# If it contains '.', it's a subdomain.
if not base_name.contains('.'):
return true
# If we get here, no matches were found
return false