this simple code makes urlparse
get crazy and it does not get the hostname properly but sets it up to None
:
from urllib.parse import urlparse
parsed = urlparse("google.com/foo?bar=8")
print(parsed.hostname)
Am I missing something?
this simple code makes urlparse
get crazy and it does not get the hostname properly but sets it up to None
:
from urllib.parse import urlparse
parsed = urlparse("google.com/foo?bar=8")
print(parsed.hostname)
Am I missing something?
According to https://www.rfc-editor.org/rfc/rfc1738#section-2.1:
Scheme names consist of a sequence of characters. The lower case letters "a"--"z", digits, and the characters plus ("+"), period ("."), and hyphen ("-") are allowed. For resiliency, programs interpreting URLs should treat upper case letters as equivalent to lower case in scheme names (e.g., allow "HTTP" as well as "http").
Using advice given in previous answers, I wrote this helper function which can be used in place of urllib.parse.urlparse()
:
#!/usr/bin/env python3
import re
import urllib.parse
def urlparse(address):
if not re.search(r'^[A-Za-z0-9+.\-]+://', address):
address = 'tcp://{0}'.format(address)
return urllib.parse.urlparse(address)
url = urlparse('localhost:1234')
print(url.hostname, url.port)
A previous version of this function called urllib.parse.urlparse(address)
, and then prepended the "tcp" scheme if one wasn't found; but this interprets the username as the scheme if you pass it something like "user:pass@localhost:1234".
google.com/foo?bar=8
is a relative URL aka a "path" with a "query". Perhaps you see google.com
as a hostname, but it doesn't have to be (and how would python know?)
URLs consist of protocol or scheme ('https:', 'ftp:', etc.), host ('//example.com'), path, query, fragment.
So urlparse is making it's best guess, returning None for protocol and host.
Just to add some further context to Muadh's answer. Look at the output from these two variations using urlparse:
>>> parsed = urlparse("google.com/foo?bar=8")
>>> parsed
ParseResult(scheme='',
netloc='',
path='google.com/foo',
params='',
query='bar=8',
fragment='')
And with the full path specified
>>> parsed = urlparse("http://google.com/foo?bar=8")
>>> parsed
ParseResult(scheme='http',
netloc='google.com',
path='/foo',
params='',
query='bar=8',
fragment='')
For this to work properly you have to include the protocol identifier (http://). This is what worked for me:
parsed = urlparse("https://www.google.com/foo?bar=8")
print(parsed.hostname)
The output from here was: www.google.com (which seems expected). More can be read about how to use urlparse here.
Hope this helps you out!
urlparse
references python2. urlparse
is imported differently in python3 –
Halloween © 2022 - 2024 — McMap. All rights reserved.