Python urlparse -- extract domain name without subdomain
Asked Answered
T

7

57

Need a way to extract a domain name without the subdomain from a url using Python urlparse.

For example, I would like to extract "google.com" from a full url like "http://www.google.com".

The closest I can seem to come with urlparse is the netloc attribute, but that includes the subdomain, which in this example would be www.google.com.

I know that it is possible to write some custom string manipulation to turn www.google.com into google.com, but I want to avoid by-hand string transforms or regex in this task. (The reason for this is that I am not familiar enough with url formation rules to feel confident that I could consider every edge case required in writing a custom parsing function.)

Or, if urlparse can't do what I need, does anyone know any other Python url-parsing libraries that would?

Tyrannous answered 18/1, 2013 at 19:33 Comment(13)
When you say remove www, does that mean all subdomains, or just that particular one?Celia
related : #1067433Eagan
@Lattyware -- good question, sorry I did not make that more clear. I edited the question to reflect the answer.Tyrannous
So for google.co.uk, you want to get rid of google?!?Vesper
@Anony-Mousse, no, I would like google.co.uk from www.google.co.uk. I'm sorry this was not worded very clearly the first time around and I edited it again to try to make it clearer.Tyrannous
So maybe, only remove www. if the domain starts with that? No need for a library to do that.Vesper
@Anony-Mousse I think he wants to remove everything but the base domain and the tld.Celia
@Lattyware yes you are rightTyrannous
@ClayWardell: You're two comments here seem inconsistent. Removing "everything but the base domain and the tld" means that for www.google.co.uk you remove everything but co.uk. But above you said you wanted google.co.uk. So, which is it?Gravois
@ClayWardell: I suspect you haven't thought through the fact that what you think of as the "site name" ("Google") is sometimes the 2LD (www.google.com), sometimes the 3LD (www.google.co.uk), sometimes even deeper (www.clay.wardell.co.uk), or even ambiguous (in www.mail.yahoo.co.uk do you want just yahoo.co.uk or mail.yahoo.co.uk?). You need to define the actual heuristic algorithm you want before you can ask how to code it. (Or, alternatively, ask what heuristic algorithms others have already defined so you can look them over.)Gravois
@Gravois with www.google.co.uk, I interpreted the url parts as follows: subdomain: www, base domain: google, tld: co.uk. So the base domain plus the tld would be google.co.uk. I could be wrong but I always thought .co.uk was just a UK version of the American .com.Tyrannous
@ClayWardell: But that's not what "tld" means. It stands for "top-level domain", and ".uk" is the top-level domain. And yes, ".co.uk" is effectively the UK equivalent of the US (or global) ".com"—but that's exactly the point. Things that are at the second level are often equivalent to things that are at the third level, like the "google" in "www.google.com" and in "www.google.co.uk". Or the "joeschmoe" in "www.joeschmoe.com" vs. "www.joeschmoe.freesites.com" (or even "www.joeschmoe.freesites.co.uk").Gravois
Plus, I remember there was discussion about getting rid of the restrictions altogether. Which would enable domains such as windows.microsoft. So what would the tld be then, when this change to DNS comes?Vesper
C
79

You probably want to check out tldextract, a library designed to do this kind of thing.

It uses the Public Suffix List to try and get a decent split based on known gTLDs, but do note that this is just a brute-force list, nothing special, so it can get out of date (although hopefully it's curated so as not to).

>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

So in your case:

>>> extracted = tldextract.extract('http://www.google.com')
>>> "{}.{}".format(extracted.domain, extracted.suffix)
"google.com"
Celia answered 18/1, 2013 at 19:38 Comment(7)
Looks like a good heuristic, nevertheless. I figure that a lot of the time, just stripping of known prefixes (www. etc.) is more useful though.Vesper
@Anony-Mousse Very much depends on the use case.Celia
ExtractResult(subdomain='my.first', domain='last', tld='name') - which is what you would expect.Celia
Interesting quirk of this library is there is a hidden folder in the tldextract folder called .tld_set_snapshot that I needed to paste into my web app, in the same folder as tldextract.py, for it to work -- otherwise I got an error that it couldn't find that file. But other than that it seems to work great. Thanks :)Tyrannous
I'd presume that's to cache the public suffix list. Glad to hear it works for you.Celia
tldextract pulls in all of requests which seems a bit excessive. tldextract.extract('www.google.co.uk') gives me multiple SSL warnings (!) but eventually succeeds.Oestrogen
I'd like to draw the attention to a serious shortcoming of the tldextract package. There's NO VALIDATION. I'm using it for a small project and I've noticed that tldextract just doesn't care what is the string. >>> k = tldextract.extract('index.php?page=sign-varen') >>> k ExtractResult(subdomain='index', domain='php', suffix='') or >>> k = tldextract.extract('step1_orderintro.html') >>> k ExtractResult(subdomain='step1_orderintro', domain='html', suffix='')Halicarnassus
B
24

This is an update, based on the bounty request for an updated answer

Start by using the tld package. A description of the package:

Extracts the top level domain (TLD) from the URL given. List of TLD names is taken from Mozilla http://mxr.mozilla.org/mozilla/source/netwerk/dns/src/effective_tld_names.dat?raw=1

from tld import get_tld
from tld.utils import update_tld_names
update_tld_names()

print get_tld("http://www.google.co.uk")
print get_tld("http://zap.co.it")
print get_tld("http://google.com")
print get_tld("http://mail.google.com")
print get_tld("http://mail.google.co.uk")
print get_tld("http://google.co.uk")

This outputs

google.co.uk
zap.co.it
google.com
google.com
google.co.uk
google.co.uk

Notice that it correctly handles country level TLDs by leaving co.uk and co.it, but properly removes the www and mail subdomains for both .com and .co.uk

The update_tld_names() call at the beginning of the script is used to update/sync the tld names with the most recent version from Mozilla.

Buatti answered 6/3, 2014 at 14:59 Comment(3)
Is there any particular reason to recommend this over tldextract and/or publicsuffix?Oestrogen
tld.get_tld('www.google.co.uk', fix_protocol=True) fails with "zero length field name in url format" for me.Oestrogen
Not sure if it's a version issue, but on python3.6, get_tld("http://mail.google.co.uk") returns co.uk, and similar.Quickly
V
9

This is not a standard decomposition of the URLs.

You cannot rely on the www. to be present or optional. In a lot of cases it will not.

So if you do want to assume that only the last two components are relevant (which also won't work for the uk, e.g. www.google.co.uk) then you can do a split('.')[-2:].

Or, which is actually less error prone, strip a www. prefix.

But in either way you cannot assume that the www. is optional, because it will NOT work every time!

Here is a list of common suffixes for domains. You can try to keep the suffix + one component.

https://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

But how do you plan to handle for example first.last.name domains? Assume that all the users with the same last name are the same company? Initially, you would only be able to get third-level domains there. By now, you apparently can get second level, too. So for .name there is no general rule.

Vesper answered 18/1, 2013 at 19:36 Comment(1)
+1 for pointing out that there is no actual correct answer, and for also giving the two best heuristics (use—or get from elsewhere—a list of "effective TLDs" and just make a choice for the ambiguous ones, or use a list of "discardable prefixes" and keep everything else).Gravois
Z
3

For domain name manipulation, you can also use Dnspy (Disclaimer: I wrote this library)

It helps extract domains (and domain labels) at various levels, using a fresh copy of Mozilla Public Suffix list.

Zandrazandt answered 24/2, 2014 at 21:33 Comment(0)
D
1

Using the tldexport works fine, but apparently has a problem while parsing the blogspot.com subdomain and create a mess. If you would like to go ahead with that library, make sure to implement an if condition or something to prevent returning an empty string in the subdomain.

Disjunction answered 18/8, 2013 at 19:37 Comment(0)
S
0
from tld import get_tld
from tld.utils import update_tld_names
update_tld_names()

result=get_tld('http://www.google.com')
print 'https://'+result

Input: http://www.google.com

Result: google.com

Synchrotron answered 21/1, 2015 at 9:57 Comment(0)
O
0

There are multiple Python modules which encapsulate the (once Mozilla) Public Suffix List in a library, several of which don't require the input to be a URL. Even though the question asks about URL normalization specifically, my requirement was to handle just domain names, and so I'm offering a tangential answer for that.

The relative merits of publicsuffix2 over publicsuffixlist or publicsuffix are unclear, but they all seem to offer the basic functionality.

publicsuffix2:

>>> import publicsuffix  # sic
>>> publicsuffix.PublicSuffixList().get_public_suffix('www.google.co.uk')
u'google.co.uk'
  • Supposedly more packaging-friendly fork of publicsuffix.

publicsuffixlist:

>>> import publicsuffixlist
>>> publicsuffixlist.PublicSuffixList().privatesuffix('www.google.co.uk')
'google.co.uk'
  • Advertises idna support, which I however have not tested.

publicsuffix:

>>> import publicsuffix
>>> publicsuffix.PublicSuffixList(publicsuffix.fetch()).get_public_suffix('www.google.co.uk')
'google.co.uk'
  • The requirement to handle the updates and caching the downloaded file yourself is a bit of a complication.
Oestrogen answered 29/3, 2017 at 10:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.