How to extract top-level domain name (TLD) from URL
Asked Answered
C

8

71

how would you extract the domain name from a URL, excluding any subdomains?

My initial simplistic attempt was:

'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])

This works for http://www.foo.com, but not http://www.foo.com.au. Is there a way to do this properly without using special knowledge about valid TLDs (Top Level Domains) or country codes (because they change).

thanks

Cracksman answered 1/7, 2009 at 1:42 Comment(3)
A related question previously on Stack Overflow: #569637Lactoscope
+1: The "simplistic attempt" in this question works well for me, even if it ironically didn't work for the author.Toffeenosed
Similar question: #14406800Vertebral
L
56

No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it is a subdomain (because Italy's registrar DOES sell domains such as co.it) while zap.co.uk isn't (because the UK's registrar DOESN'T sell domains such as co.uk, but only like zap.co.uk).

You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).

Lacagnia answered 1/7, 2009 at 1:48 Comment(0)
A
68

Here's a great python module someone wrote to solve this problem after seeing this question: https://github.com/john-kurkowski/tldextract

The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers

Quote:

tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

Animate answered 12/9, 2011 at 13:46 Comment(2)
This worked for me where tld failed (it marked a valid URL as invalid).Tazza
Lost too much time thinking about the problem, should have known and used this from the start.Guerra
L
56

No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it is a subdomain (because Italy's registrar DOES sell domains such as co.it) while zap.co.uk isn't (because the UK's registrar DOESN'T sell domains such as co.uk, but only like zap.co.uk).

You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).

Lacagnia answered 1/7, 2009 at 1:48 Comment(0)
O
42

Using this file of effective tlds which someone else found on Mozilla's website:

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

results in:

abcde.co.uk

I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the last_i_elements list, but I couldn't think of one. I also don't know if ValueError is the best thing to raise. Comments?

Overstudy answered 1/7, 2009 at 15:23 Comment(4)
If you need to call getDomain() often in practice, such as extracting domains from a large log file, I would recommend that you make tlds a set, e.g. tlds = set([line.strip() for line in tldFile if line[0] not in "/\n"]). This gives you constant time lookup for each of those checks for whether some item is in tlds. I saw a speedup of about 1500 times for the lookups (set vs. list) and for my entire operation extracting domains from a ~20 million line log file, about a 60 times speedup (6 minutes down from 6 hours).Beilul
This is awesome! Just one more question: is that effective_tld_names.dat file also updated for new domains such as .amsterdam, .vodka and .wtf?Belter
The Mozilla public suffix list gets regular maintenance, yes, and now has multiple Python libraries which include it. See publicsuffix.org and the other answers on this page.Theology
Some updates to get this right in 2021: the file is now called public_suffix_list.dat, and Python will complain if you don't specify that it should read the file as UTF8. Specify the encoding explicitly: with open("public_suffix_list.dat", encoding="utf8") as tld_fileMarshmallow
S
42

Using python tld

https://pypi.python.org/pypi/tld

Install

pip install tld

Get the TLD name as string from the URL given

from tld import get_tld
print get_tld("http://www.google.co.uk") 

co.uk

or without protocol

from tld import get_tld

get_tld("www.google.co.uk", fix_protocol=True)

co.uk

Get the TLD as an object

from tld import get_tld

res = get_tld("http://some.subdomain.google.co.uk", as_object=True)

res
# 'co.uk'

res.subdomain
# 'some.subdomain'

res.domain
# 'google'

res.tld
# 'co.uk'

res.fld
# 'google.co.uk'

res.parsed_url
# SplitResult(
#     scheme='http',
#     netloc='some.subdomain.google.co.uk',
#     path='',
#     query='',
#     fragment=''
# )

Get the first level domain name as string from the URL given

from tld import get_fld

get_fld("http://www.google.co.uk")
# 'google.co.uk'
Sogdian answered 16/5, 2013 at 6:46 Comment(8)
This will become more unreliable with the new gTLDs.Salsbury
Hey, thanks for pointing at this. I guess, when it comes to the point that new gTLDs are actually being used, a proper fix could come into the tld package.Sogdian
Thank you @ArturBarseghyan ! Its very easy to use with Python. But I am using it now for enterprise grade product, is it a good idea to continue using it even if gTLDs are not being supported? If yes, when do you think gTLDs will be supported ? Thank you again.Praenomen
@Akshay Patil: As stated above, when it comes to the point that gTLDs are intensively used, a proper fix (if possible) would arrive in the package. In the meanwhile, if you're concerned much about gTLDs, you can always catch the tld.exceptions.TldDomainNotFound exception and proceed anyway with whatever you were doing, even if domain hasn't been found.Sogdian
Is it just me, or does tld.get_tld() actually return a fully qualified domain name, not a top level domain?Finned
get_tld("http://www.google.co.uk", as_object=True).extension would print out: "co.uk"Sogdian
Having URL parsing functionality built in is nice, I suppose, but requiring input to be a URL seems misdirected. If I want to handle host names for SSH or whatever, forcing them to be URLs (or "accepting" that the protocol is "missing") is just weird.Theology
triplee: It works without protocol as well. see the updated example.Sogdian
C
2

There are many, many TLD's. Here's the list:

http://data.iana.org/TLD/tlds-alpha-by-domain.txt

Here's another list

http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

Here's another list

http://www.iana.org/domains/root/db/

Cosma answered 1/7, 2009 at 1:51 Comment(2)
That doesn't help, because it doesn't tell you which ones have an "extra level", like co.uk.Pincas
Lennart: It helps, U can wrap them to be optional, within a regex.Rebato
W
0

Until get_tld is updated for all the new ones, I pull the tld from the error. Sure it's bad code but it works.

def get_tld():
  try:
    return get_tld(self.content_url)
  except Exception, e:
    re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
    matchObj = re_domain.findall(str(e))
    if matchObj:
      for m in matchObj:
        return m
    raise e
Watercool answered 8/4, 2015 at 21:36 Comment(0)
G
-1

Here's how I handle it:

if not url.startswith('http'):
    url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
    sys.exit(2)
elif not match.group(0):
    sys.exit(2)
Gelatinize answered 19/3, 2013 at 18:53 Comment(1)
There is a domain called .travel. It won't work with the above code.Benioff
F
-1

In Python I used to use tldextract until it failed with a url like www.mybrand.sa.com parsing it as subdomain='order.mybrand', domain='sa', suffix='com'!!

So finally, I decided to write this method

IMPORTANT NOTE: this only works with urls that have a subdomain in them. This isn't meant to replace more advanced libraries like tldextract

def urlextract(url):
  url_split=url.split(".")
  if len(url_split) <= 2:
      raise Exception("Full url required with subdomain:",url)
  return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}
Flow answered 28/5, 2019 at 16:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.