Need a way to extract a domain name without the subdomain from a url using Python urlparse.
For example, I would like to extract "google.com"
from a full url like "http://www.google.com"
.
The closest I can seem to come with urlparse
is the netloc
attribute, but that includes the subdomain, which in this example would be www.google.com
.
I know that it is possible to write some custom string manipulation to turn www.google.com into google.com, but I want to avoid by-hand string transforms or regex in this task. (The reason for this is that I am not familiar enough with url formation rules to feel confident that I could consider every edge case required in writing a custom parsing function.)
Or, if urlparse
can't do what I need, does anyone know any other Python url-parsing libraries that would?
google.co.uk
, you want to get rid ofgoogle
?!? – Vesperwww.
if the domain starts with that? No need for a library to do that. – Vesperwww.google.co.uk
you remove everything butco.uk
. But above you said you wantedgoogle.co.uk
. So, which is it? – Gravoiswww.google.com
), sometimes the 3LD (www.google.co.uk
), sometimes even deeper (www.clay.wardell.co.uk
), or even ambiguous (inwww.mail.yahoo.co.uk
do you want justyahoo.co.uk
ormail.yahoo.co.uk
?). You need to define the actual heuristic algorithm you want before you can ask how to code it. (Or, alternatively, ask what heuristic algorithms others have already defined so you can look them over.) – Gravoiswindows.microsoft
. So what would thetld
be then, when this change to DNS comes? – Vesper