I am tring to extract the domain names out of a list of URLs. Just like in
https://stackoverflow.com/questions/18331948/extract-domain-name-from-the-url
My problem is that the URLs can be about everything, few examples:
m.google.com
=> google
m.docs.google.com
=> google
www.someisotericdomain.innersite.mall.co.uk
=> mall
www.ouruniversity.department.mit.ac.us
=> mit
www.somestrangeurl.shops.relevantdomain.net
=> relevantdomain
www.example.info
=> example
And so on..
The diversity of the domains doesn't allow me to use a regex as shown in how to get domain name from URL (because my script will be running on enormous amount of urls from real network traffic, the regex will have to be enormous in order to catch all kinds of domains as mentioned).
Unfortunately my web research the didn't provide any efficient solution.
Does anyone have an idea of how to do this ?
Any help will be appreciated !
Thank you
Extract domain name from URL in Python
Use tldextract
which is more efficient version of urlparse
, tldextract
accurately separates the gTLD
or ccTLD
(generic or country code top-level domain) from the registered domain
and subdomains
of a URL.
>>> import tldextract
>>> ext = tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
>>> ext.domain
'cnn'
Note: the
tldextract
library makes an http request upon initial install and creates a cache of the latest tld data. This can raise a permission error for some remote deployments. See here: github.com/john-kurkowski/tldextract#note-about-caching –
Dogma It seems you can use urlparse https://docs.python.org/3/library/urllib.parse.html for that url, and then extract the netloc.
And from the netloc you could easily extract the domain name by using split
Thank you for your response, unfortunately, using urlparse on url like
m.city.domain.com
returned me ParseResult(scheme='', netloc='', path='m.city.domain.com', params='', query='', fragment='')
, while the expected output was domain
–
Controller Use a valid URL (//m.city.domain.com/), not a something like (m.city.domain.com). Nobody can guess what did you pass when you removed backslashes. –
Grainfield
@Controller
urlparse
follows rfc 1808 syntax which requires //
before net_loc
docs.python.org/3/library/urllib.parse.html –
Only For extracting domain from url
from urllib.parse import urlparse
url = "https://mcmap.net/q/527702/-extract-domain-name-from-url-in-python"
domain = urlparse(url).netloc
"stackoverflow.com"
For check domain is exist in url
if urlparse(url).netloc in ["domain1", "domain2", "domain3"]:
do something
pu.netloc
may include port. You might want pu.hostname
instead (to get the domain without port). –
Only Simple solution via regex
import re
def domain_name(url):
return url.split("www.")[-1].split("//")[-1].split(".")[0]
Gets the first part of the domain, not the actual domain. Only works for things like www.google.com –
Oversell
Unreliable solution, avoid. –
Andersonandert
With regex, you could use something like this:
(?<=\.)([^.]+)(?:\.(?:co\.uk|ac\.us|[^.]+(?:$|\n)))
https://regex101.com/r/WQXFy6/5
Notice, you'll have to watch out for special cases such as co.uk
.
import re
def getDomain(url:str) -> str:
'''
Return the domain from any url
'''
# copy the original url text
clean_url = url
# take out protocol
reg = re.findall(':[0-9]+',url)
if len(reg) > 0:
url = url.replace(reg[0],'')
# take out paths routes
if '/' in url:
url = url.split('/')
# select only the domain
if 'http' in clean_url:
url = url[2]
# preparing for next operation
url = ''.join(url)
# select only domain
url = '.'.join(url.split('.')[-2:])
return url
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. –
Levitate
from urllib.parse import urlparse
import validators
hostnames = []
counter = 0
errors = 0
for row_orig in rows:
try:
row = row_orig.rstrip().lstrip().split(' ')[1].rstrip()
if len(row) < 5:
print(f"Empty row {row_orig}")
errors += 1
continue
if row.startswith('http'):
domain = urlparse(row).netloc # works for https and http
else:
domain = row
if ':' in domain:
domain = domain.split(':')[0] # split at port after clearing http/https protocol
# Finally validate it
if validators.domain(domain):
pass
elif validators.ipv4(domain):
pass
else:
print(f"Invalid domain/IP {domain}. RAW: {row}")
errors +=1
continue
hostnames.append(domain)
if counter % 10000 == 1:
print(f"Added {counter}. Errors {errors}")
counter+=1
except:
print("Error in extraction")
errors += 1
tests = {
"m.google.com": 'google',
"m.docs.google.com": 'google',
"www.someisotericdomain.innersite.mall.co.uk": 'mall',
"www.ouruniversity.department.mit.ac.us": 'mit',
"www.somestrangeurl.shops.relevantdomain.net": 'relevantdomain',
"www.example.info": 'example',
"github.com": 'github',
}
def get_domain(url, loop=0, data={}):
dot_count = url.count('.')
if not dot_count:
raise Exception("Invalid URL")
# basic
if not loop:
if dot_count < 3:
data = {
'main': url.split('.')[0 if dot_count == 1 else 1]
}
# advanced
if not data and '.' in url:
if dot_count > 1:
loop += 1
start = url.find('.')+1
end = url.rfind('.') if dot_count != 2 else None
return get_domain(url[start:end], loop, data)
else:
data ={
'main': url.split('.')[-1]
}
return data
for u, v in tests.items():
print(get_domain(u))
© 2022 - 2024 — McMap. All rights reserved.
urllib.parse
– Avulsion