How can I remove 'www.' from original URL through [urllib] parse in python?

Asked 24/7, 2021 at 5:24 Answered 30/9, 2024 at 18:15

Original URL ▶ https://www.exeam.org/index.html

I want to extract exeam.org/ or exeam.org from original URL.

To do this, I used urllib the most powerful parser in Python that I know, but unfortunately urllib (url.scheme, url.netloc ...) couldn't give me the type of format I wanted.

Ineducable answered 24/7, 2021 at 5:24 Comment(2)

'.'.join(urlparse('https://www.exeam.org/index.html').netloc.split('.')[1:]) #44113835 – Blintze 24/7, 2021 at 5:34

what do you mean by not only the original URL of the Inquiry but also the majority? I am sorry not to understand. – Blintze 24/7, 2021 at 5:35

to extract the domain name from a url using `urllib):

from urllib.parse import urlparse
surl = "https://www.exam.org/index.html"
urlparsed = urlparse(surl)
# network location from parsed url
print(urlparsed.netloc)
# ParseResult Object
print(urlparsed)

this will give you www.exam.org, but you want to further decompose this to registered domain if you are after just the exam.org part. so besides doing simple splits, which could be sufficient, you could also use library such as tldextract which knows how to parse subdmains, suffixes and more:

from  tldextract import extract

ext = extract(surl)
print(ext.registered_domain)

this will produce:

exam.org

Fenelia answered 24/7, 2021 at 5:56 Comment(0)

you could use this without use any extra library:

from urllib.parse import urlsplit

def domain_name(url):
    domain = urlsplit(url).netloc
    return domain.split('www.')[1] if domain.startswith('www.') else domain

Xerophthalmia answered 30/9, 2024 at 18:15 Comment(0)

Recommended topics

Hot tags