Get protocol + host name from URL
Asked Answered
P

16

211

In my Django app, I need to get the host name from the referrer in request.META.get('HTTP_REFERER') along with its protocol so that from URLs like:

  • https://docs.google.com/spreadsheet/ccc?key=blah-blah-blah-blah#gid=1
  • https://mcmap.net/q/128757/-c-program-works-from-cmd-prompt-but-not-run-separately
  • http://www.example.com
  • https://www.other-domain.example/whatever/blah/blah/?v1=0&v2=blah+blah

I should get:

  • https://docs.google.com/
  • https://stackoverflow.com/
  • http://www.example.com
  • https://www.other-domain.example/

I looked over other related questions and found about urlparse, but that didn't do the trick since

>>> urlparse(request.META.get('HTTP_REFERER')).hostname
'docs.google.com'
Pony answered 8/3, 2012 at 23:12 Comment(0)
C
369

You should be able to do it with urlparse (docs: python2, python3):

from urllib.parse import urlparse
# from urlparse import urlparse  # Python 2
parsed_uri = urlparse('https://mcmap.net/q/128757/-c-program-works-from-cmd-prompt-but-not-run-separately' )
result = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
print(result)

# gives
'http://stackoverflow.com/'
Codicil answered 8/3, 2012 at 23:17 Comment(6)
this answer adds a / to the third example http://www.domain.com, but I think this might be a shortcoming of the question, not of the answer.Tobytobye
@TokenMacGuy: ya, my bad... didn't notice the missing /Pony
I don't think this is a good solution, as netloc is not domain: try urlparse.urlparse('http://user:[email protected]:8080') and find it gives parts like 'user:pass@' and ':8080'Exponible
The urlparse module is renamed to urllib.parse in Python 3. So, from urllib.parse import urlparseFrangipani
This answers what the author meant to ask, but not what was actually stated. For those looking for domain name and not hostname (as this solution provides) I suggest looking at dm03514's answer that is currently below. Python's urlparse cannot give you domain names. Something that seems an oversight.Yarvis
String operations should be avoided at allcosts. use built ins from urllib.parse: https://mcmap.net/q/126042/-get-protocol-host-name-from-urlBeamends
L
97

https://github.com/john-kurkowski/tldextract

This is a more verbose version of urlparse. It detects domains and subdomains for you.

From their documentation:

>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
>>> tldextract.extract('http://forums.bbc.co.uk/') # United Kingdom
ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk')
>>> tldextract.extract('http://www.worldbank.org.kg/') # Kyrgyzstan
ExtractResult(subdomain='www', domain='worldbank', suffix='org.kg')

ExtractResult is a namedtuple, so it's simple to access the parts you want.

>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> ext.domain
'bbc'
>>> '.'.join(ext[:2]) # rejoin subdomain and domain
'forums.bbc'
Lefthander answered 9/3, 2012 at 1:16 Comment(1)
This is the correct answer for the question as written, how to get the DOMAIN name. The chosen solution provides the HOSTNAME, which I believe is what the author wanted in the first place.Yarvis
T
52

Python3 using urlsplit:

from urllib.parse import urlsplit
url = "https://mcmap.net/q/126042/-get-protocol-host-name-from-url"
base_url = "{0.scheme}://{0.netloc}/".format(urlsplit(url))
print(base_url)
# http://stackoverflow.com/
Trochee answered 22/12, 2013 at 11:20 Comment(0)
O
35
>>> import urlparse
>>> url = 'https://mcmap.net/q/128757/-c-program-works-from-cmd-prompt-but-not-run-separately'
>>> urlparse.urljoin(url, '/')
'http://stackoverflow.com/'
Oto answered 25/8, 2015 at 22:2 Comment(2)
For Python 3 the import is from urllib.parse import urlparse.Sarco
The argument doesn't seem intuitive, but it works great as a very simple native solutionBoney
M
26

Pure string operations :):

>>> url = "https://mcmap.net/q/126042/-get-protocol-host-name-from-url"
>>> url.split("//")[-1].split("/")[0].split('?')[0]
'stackoverflow.com'
>>> url = "https://mcmap.net/q/126042/-get-protocol-host-name-from-url"
>>> url.split("//")[-1].split("/")[0].split('?')[0]
'stackoverflow.com'
>>> url = "http://foo.bar?haha/whatever"
>>> url.split("//")[-1].split("/")[0].split('?')[0]
'foo.bar'

That's all, folks.

Melton answered 13/4, 2016 at 21:26 Comment(2)
Good and simple option, but fails in some cases, e.g. foo.bar?hahaTabescent
@SimonSteinberger :-) How'bout this : url.split("//")[-1].split("/")[0].split('?')[0] :-))Melton
P
15

The standard library function urllib.parse.urlsplit() is all you need. Here is an example for Python3:

>>> import urllib.parse
>>> o = urllib.parse.urlsplit('https://user:[email protected]:8080/dir/page.html?q1=test&q2=a2#anchor1')
>>> o.scheme
'https'
>>> o.netloc
'user:[email protected]:8080'
>>> o.hostname
'www.example.com'
>>> o.port
8080
>>> o.path
'/dir/page.html'
>>> o.query
'q1=test&q2=a2'
>>> o.fragment
'anchor1'
>>> o.username
'user'
>>> o.password
'pass'
Panjandrum answered 3/3, 2020 at 20:38 Comment(0)
A
9

if you think your url is valid then this will work all the time

domain = "http://google.com".split("://")[1].split("/")[0] 
Ailey answered 7/5, 2018 at 12:57 Comment(3)
The last split is wrong, there are no more forward slashes to split.Phyllotaxis
it's won't be a problem, if there are no more slashes then, the list will return with one element. so it will work whether there is a slash or notAiley
I edited your answer the be able to remove the down-vote. Nice explanation. Tks.Phyllotaxis
G
6

Here is a slightly improved version:

urls = [
    "http://stackoverflow.com:8080/some/folder?test=/questions/9626535/get-domain-name-from-url",
    "Stackoverflow.com:8080/some/folder?test=/questions/9626535/get-domain-name-from-url",
    "http://stackoverflow.com/some/folder?test=/questions/9626535/get-domain-name-from-url",
    "https://StackOverflow.com:8080?test=/questions/9626535/get-domain-name-from-url",
    "stackoverflow.com?test=questions&v=get-domain-name-from-url"]
for url in urls:
    spltAr = url.split("://");
    i = (0,1)[len(spltAr)>1];
    dm = spltAr[i].split("?")[0].split('/')[0].split(':')[0].lower();
    print dm

Output

stackoverflow.com
stackoverflow.com
stackoverflow.com
stackoverflow.com
stackoverflow.com

Fiddle: https://pyfiddle.io/fiddle/23e4976e-88d2-4757-993e-532aa41b7bf0/?i=true

Gatehouse answered 23/11, 2017 at 11:2 Comment(3)
IMHO the best solution, because simple and it considers all sorts of rare cases. Thanks!Tabescent
neither simple nor improvedRett
This is not a solution for the question because you do not provide protocol (https:// or http://)Deathbed
T
5

Is there anything wrong with pure string operations:

url = 'https://mcmap.net/q/126042/-get-protocol-host-name-from-url'
parts = url.split('//', 1)
print parts[0]+'//'+parts[1].split('/', 1)[0]
>>> http://stackoverflow.com

If you prefer having a trailing slash appended, extend this script a bit like so:

parts = url.split('//', 1)
base = parts[0]+'//'+parts[1].split('/', 1)[0]
print base + (len(url) > len(base) and url[len(base)]=='/'and'/' or '')

That can probably be optimized a bit ...

Tabescent answered 7/6, 2013 at 22:6 Comment(1)
it's not wrong but we got a tool that already does the work, let's not reinvent the wheel ;)Pony
W
3

I know it's an old question, but I too encountered it today. Solved this with an one-liner:

import re
result = re.sub(r'(.*://)?([^/?]+).*', '\g<1>\g<2>', url)
Whydah answered 2/11, 2018 at 3:32 Comment(0)
S
3

It could be solved by re.search()

import re
url = 'https://docs.google.com/spreadsheet/ccc?key=blah-blah-blah-blah#gid=1'
result = re.search(r'^http[s]*:\/\/[\w\.]*', url).group()
print(result)

#result
'https://docs.google.com'
Saladin answered 5/11, 2019 at 21:49 Comment(2)
This answer was helpful for my case. Thanks.Blancablanch
Does not include portGodly
T
2

This is a bit obtuse, but uses urlparse in both directions:

import urlparse
def uri2schemehostname(uri):
    urlparse.urlunparse(urlparse.urlparse(uri)[:2] + ("",) * 4)

that odd ("",) * 4 bit is because urlparse expects a sequence of exactly len(urlparse.ParseResult._fields) = 6

Tobytobye answered 9/3, 2012 at 1:43 Comment(0)
Z
2

You can simply use urljoin with relative root '/' as second argument:

import urllib.parse


url = 'https://mcmap.net/q/126042/-get-protocol-host-name-from-url'
root_url = urllib.parse.urljoin(url, '/')
print(root_url)
Zug answered 15/7, 2020 at 18:47 Comment(0)
R
2

This is the simple way to get the root URL of any domain.

from urllib.parse import urlparse

url = urlparse('https://stackoverflow.com/questions/9626535/')
root_url = url.scheme + '://' + url.hostname
print(root_url) # https://stackoverflow.com
Rocha answered 21/12, 2021 at 17:42 Comment(0)
A
0

to get domain/hostname and Origin*

url = 'https://mcmap.net/q/126042/-get-protocol-host-name-from-url'
hostname = url.split('/')[2] # stackoverflow.com
origin = '/'.join(url.split('/')[:3]) # https://stackoverflow.com

*Origin is used in XMLHttpRequest headers

Aniconic answered 25/2, 2019 at 18:34 Comment(0)
S
-1

If it contains less than 3 slashes thus you've it got and if not then we can find the occurrence between it:

import re

link = http://forum.unisoftdev.com/something

slash_count = len(re.findall("/", link))
print slash_count # output: 3

if slash_count > 2:
   regex = r'\:\/\/(.*?)\/'
   pattern  = re.compile(regex)
   path = re.findall(pattern, url)

   print path
Sodomy answered 21/6, 2018 at 20:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.