urlparse fails with simple url
Asked Answered
P

4

8

this simple code makes urlparse get crazy and it does not get the hostname properly but sets it up to None:

from urllib.parse import urlparse
parsed = urlparse("google.com/foo?bar=8")
print(parsed.hostname)

Am I missing something?

Priscella answered 24/5, 2018 at 0:32 Comment(0)
D
5

According to https://www.rfc-editor.org/rfc/rfc1738#section-2.1:

Scheme names consist of a sequence of characters. The lower case letters "a"--"z", digits, and the characters plus ("+"), period ("."), and hyphen ("-") are allowed. For resiliency, programs interpreting URLs should treat upper case letters as equivalent to lower case in scheme names (e.g., allow "HTTP" as well as "http").

Using advice given in previous answers, I wrote this helper function which can be used in place of urllib.parse.urlparse():

#!/usr/bin/env python3
import re
import urllib.parse

def urlparse(address):
    if not re.search(r'^[A-Za-z0-9+.\-]+://', address):
        address = 'tcp://{0}'.format(address)
    return urllib.parse.urlparse(address)

url = urlparse('localhost:1234')
print(url.hostname, url.port)

A previous version of this function called urllib.parse.urlparse(address), and then prepended the "tcp" scheme if one wasn't found; but this interprets the username as the scheme if you pass it something like "user:pass@localhost:1234".

Duluth answered 18/10, 2019 at 21:36 Comment(0)
E
3

google.com/foo?bar=8 is a relative URL aka a "path" with a "query". Perhaps you see google.com as a hostname, but it doesn't have to be (and how would python know?)

URLs consist of protocol or scheme ('https:', 'ftp:', etc.), host ('//example.com'), path, query, fragment.

So urlparse is making it's best guess, returning None for protocol and host.

Edelmiraedelson answered 24/5, 2018 at 0:50 Comment(0)
H
2

Just to add some further context to Muadh's answer. Look at the output from these two variations using urlparse:

>>> parsed = urlparse("google.com/foo?bar=8")
>>> parsed
ParseResult(scheme='', 
            netloc='', 
            path='google.com/foo', 
            params='', 
            query='bar=8', 
            fragment='')

And with the full path specified

>>> parsed = urlparse("http://google.com/foo?bar=8")
>>> parsed
ParseResult(scheme='http', 
            netloc='google.com', 
            path='/foo', 
            params='', 
            query='bar=8', 
            fragment='')
Halloween answered 24/5, 2018 at 0:47 Comment(0)
A
0

For this to work properly you have to include the protocol identifier (http://). This is what worked for me:

parsed = urlparse("https://www.google.com/foo?bar=8")
print(parsed.hostname)

The output from here was: www.google.com (which seems expected). More can be read about how to use urlparse here.

Hope this helps you out!

Astral answered 24/5, 2018 at 0:38 Comment(4)
Yes, I know that works but what if I have a list that does not specify the schema? Actually it's not that weird that they don'tPriscella
Your link to urlparse references python2. urlparse is imported differently in python3Halloween
In this case I would check and see if it has a protocol identifier, and if not add it to the string before parsing.Astral
@Halloween Ah thanks for catching that, I fixed the link.Astral

© 2022 - 2024 — McMap. All rights reserved.