Extract domain name from URL in Python
Asked Answered
C

9

25

I am tring to extract the domain names out of a list of URLs. Just like in https://stackoverflow.com/questions/18331948/extract-domain-name-from-the-url
My problem is that the URLs can be about everything, few examples:
m.google.com => google
m.docs.google.com => google
www.someisotericdomain.innersite.mall.co.uk => mall
www.ouruniversity.department.mit.ac.us => mit
www.somestrangeurl.shops.relevantdomain.net => relevantdomain
www.example.info => example
And so on..
The diversity of the domains doesn't allow me to use a regex as shown in how to get domain name from URL (because my script will be running on enormous amount of urls from real network traffic, the regex will have to be enormous in order to catch all kinds of domains as mentioned).
Unfortunately my web research the didn't provide any efficient solution.
Does anyone have an idea of how to do this ?
Any help will be appreciated !
Thank you

Controller answered 17/5, 2017 at 10:8 Comment(6)
Can u use an external lib?Fusillade
Gather a list of top-level domains, split your url by dots, right-strip your url from TLD, extract name.Cherish
Possible duplicate of how to get domain name from URLSeersucker
Yes, I can use external libs. It is not a duplication (I even attached a link to this thread), I couldn't find a satisfying answer there.Controller
Use urllib.parseAvulsion
Does this answer your question? Get protocol + host name from URLDodds
R
42

Use tldextract which is more efficient version of urlparse, tldextract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and subdomains of a URL.

>>> import tldextract
>>> ext = tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
>>> ext.domain
'cnn'
Reservoir answered 17/5, 2017 at 10:40 Comment(1)
Note: the tldextract library makes an http request upon initial install and creates a cache of the latest tld data. This can raise a permission error for some remote deployments. See here: github.com/john-kurkowski/tldextract#note-about-cachingDogma
A
5

It seems you can use urlparse https://docs.python.org/3/library/urllib.parse.html for that url, and then extract the netloc.

And from the netloc you could easily extract the domain name by using split

Arellano answered 17/5, 2017 at 10:12 Comment(3)
Thank you for your response, unfortunately, using urlparse on url like m.city.domain.com returned me ParseResult(scheme='', netloc='', path='m.city.domain.com', params='', query='', fragment=''), while the expected output was domainController
Use a valid URL (//m.city.domain.com/), not a something like (m.city.domain.com). Nobody can guess what did you pass when you removed backslashes.Grainfield
@Controller urlparse follows rfc 1808 syntax which requires // before net_loc docs.python.org/3/library/urllib.parse.htmlOnly
M
2

For extracting domain from url

from urllib.parse import urlparse

url = "https://mcmap.net/q/527702/-extract-domain-name-from-url-in-python"
domain = urlparse(url).netloc
"stackoverflow.com"

For check domain is exist in url

if urlparse(url).netloc in ["domain1", "domain2", "domain3"]:
           do something
Mcshane answered 24/1, 2023 at 7:45 Comment(1)
pu.netloc may include port. You might want pu.hostname instead (to get the domain without port).Only
G
1

Simple solution via regex

import re

def domain_name(url):
    return url.split("www.")[-1].split("//")[-1].split(".")[0]
Geraldo answered 20/5, 2020 at 12:3 Comment(2)
Gets the first part of the domain, not the actual domain. Only works for things like www.google.comOversell
Unreliable solution, avoid.Andersonandert
I
0

With regex, you could use something like this:

(?<=\.)([^.]+)(?:\.(?:co\.uk|ac\.us|[^.]+(?:$|\n)))

https://regex101.com/r/WQXFy6/5

Notice, you'll have to watch out for special cases such as co.uk.

Intercessor answered 17/5, 2017 at 10:34 Comment(0)
M
0

Check the replace and split methods.

PS: ONLY WORKS FOR SIMPLE LINKS LIKE https://youtube.com (output=youtube) AND (www.user.ru.com) (output=user)

def domain_name(url):

return url.replace("www.","http://").split("//")[1].split(".")[0]
Mammon answered 31/5, 2022 at 16:8 Comment(0)
I
0
import re
def getDomain(url:str) -> str:
    '''
        Return the domain from any url
    '''
    # copy the original url text
    clean_url = url

    # take out protocol
    reg = re.findall(':[0-9]+',url)
    if len(reg) > 0:
        url = url.replace(reg[0],'')
    
    # take out paths routes
    if '/' in url:
        url = url.split('/')

    # select only the domain
    if 'http' in clean_url:
        url = url[2]

    # preparing for next operation
    url = ''.join(url)

    # select only domain
    url = '.'.join(url.split('.')[-2:])

    return url

Internuncio answered 25/12, 2022 at 9:10 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Levitate
U
0
from urllib.parse import urlparse
import validators

    hostnames = []
    counter = 0
    errors = 0
    for row_orig in rows:
        try:
            row = row_orig.rstrip().lstrip().split(' ')[1].rstrip()
            if len(row) < 5:
                print(f"Empty row {row_orig}")
                errors += 1
                continue
            if row.startswith('http'):
                domain = urlparse(row).netloc # works for https and http
            else:
                domain = row

            if ':' in domain:
                domain = domain.split(':')[0] # split at port after clearing http/https protocol 

            # Finally validate it
            if validators.domain(domain):
                pass
            elif validators.ipv4(domain):
                pass
            else:
                print(f"Invalid domain/IP {domain}. RAW: {row}")
                errors +=1
                continue

            hostnames.append(domain)
            if counter % 10000 == 1:
                print(f"Added {counter}. Errors {errors}")
            counter+=1
        except:
            print("Error in extraction")
            errors += 1
Unwish answered 21/1, 2023 at 18:15 Comment(0)
T
-1
tests = {
  "m.google.com": 'google',
  "m.docs.google.com": 'google',
  "www.someisotericdomain.innersite.mall.co.uk": 'mall',
  "www.ouruniversity.department.mit.ac.us": 'mit',
  "www.somestrangeurl.shops.relevantdomain.net": 'relevantdomain',
  "www.example.info": 'example',
  "github.com": 'github',
}

def get_domain(url, loop=0, data={}):

  dot_count = url.count('.')

  if not dot_count:
    raise Exception("Invalid URL")

  # basic
  if not loop:
    if dot_count < 3:
      data = {
        'main':  url.split('.')[0 if dot_count == 1 else 1]
        }

  # advanced
  if not data and '.' in url:
      if dot_count > 1:
        loop += 1
        start = url.find('.')+1
        end = url.rfind('.') if dot_count != 2 else None
        return get_domain(url[start:end], loop, data)
      else:
        data ={
          'main': url.split('.')[-1]
          }

  return data

for u, v in tests.items():
  print(get_domain(u))
Tupper answered 10/2 at 13:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.