How can I check whether a URL is valid using `urlparse`?

Asked 12/8, 2014 at 8:3 Answered 15/5, 2018 at 14:31

I want to check whether a URL is valid, before I open it to read data.

I was using the function urlparse from the urlparse package:

if not bool(urlparse.urlparse(url).netloc):
 # do something like: open and read using urllin2

However, I noticed that some valid URLs are treated as broken, for example:

url = upload.wikimedia.org/math/8/8/d/88d27d47cea8c88adf93b1881eda318d.png

This URL is valid (I can open it using my browser).

Is there a better way to check if the URL is valid?

Breena answered 12/8, 2014 at 8:3 Comment(6)

prepend http:// to url without it – Karlie 12/8, 2014 at 8:9

@Karlie But I have a lot of links, and I don't know if it will start with http:// or it will not, if it is a valid url or not. I want to write a function, which will tell me this avoiding this types of mistakes. – Breena 12/8, 2014 at 8:11

If you're going to open it with urllib2 anyway, can't you just open it first and check if the return code equals 200? – Saccharo 12/8, 2014 at 8:12

@Breena this case, I think the regexp is the best way – Karlie 12/8, 2014 at 8:14

If it's mainly the http:// that's the issue, if(url[:7] != 'http://'):...url = 'http://' + url – Volney 12/8, 2014 at 8:17

using a try/except would be the best way to go – Itself 12/8, 2014 at 8:26

You can check if the url has the scheme:

>>> url = "no.scheme.com/math/12345.png"
>>> parsed_url = urlparse.urlparse(url)
>>> bool(parsed_url.scheme)
False

If it's the case, you can replace the scheme and get a real valid url:

>>> parsed_url.geturl()
"no.scheme.com/math/12345.png"
>>> parsed_url = parsed_url._replace(**{"scheme": "http"})
>>> parsed_url.geturl()
'http:///no.scheme.com/math/12345.png'

Jemie answered 12/8, 2014 at 8:24 Comment(4)

+1 for the trick with replacing the tuple which I find very elegant (and didn't know about). The only problem here is that the returned url contains three slashes after the scheme as the url with no scheme is interpreted as path instead of netloc. A simple .replace('///', '//') does the trick for me at least. – Swabber 14/7, 2016 at 9:59

You missed import urlparse – Nolde 12/10, 2016 at 7:53

@alexey_efimov, the question already said "I was using the argparse package". – Jemie 12/10, 2016 at 12:6

Else, you can simply use import urllib; urllib.parse.urlparse(url, scheme='http') to get the same result.. – Gidgetgie 9/8, 2017 at 21:15

TL;DR: You can't actually. Every answer given already misses 1 or more cases.

String is google.com (invalid since no scheme, even though a browser assumes by default http). Urlparse will be missing scheme and netloc. So all([result.scheme, result.netloc, result.path]) seems to work for this case
String is http://google (invalid since .com is missing). Urlparse will be missing only path. Again all([result.scheme, result.netloc, result.path]) seems to catch this case
String is http://google.com/ (correct). Urlparse will populate scheme, netloc and path. So for this case all([result.scheme, result.netloc, result.path]) works fine
String is http://google.com (correct). Urlparse will be missing only path. So for this case all([result.scheme, result.netloc, result.path]) seems to give a false negative

So from the above cases you see that the one that comes closest to a solution is all([result.scheme, result.netloc, result.path]). But this works only in cases where the url contains a path (even if that is the / path).

Even if you try to enforce a path (i.e urlparse(urljoin(your_url, "/")) you will still get a false positive in case 2

Maybe something more complicated like

final_url = urlparse(urljoin(your_url, "/"))
is_correct = (all([final_url.scheme, final_url.netloc, final_url.path]) 
              and len(final_url.netloc.split(".")) > 1)

Maybe you also want to skip scheme checking and assume http if no scheme. But even this will get you up to a point. Although it covers the above cases, it doesn't fully cover cases where a url contains an ip instead of a hostname. For such cases you will have to validate that the ip is a correct ip. And there are more scenarios as well. See https://en.wikipedia.org/wiki/URL to think even more cases

Recording answered 15/5, 2018 at 14:31 Comment(1)

urljoin and urlparse end up calling urlsplit which may throw a ValueError if there are brackets (IPv6) in what it thinks is the netloc, so exception handling is necessary too – Myke 5/10, 2021 at 10:12

You can try the function below which checks scheme, netloc and path variables which comes after parsing the url. Supports both Python 2 and 3.

try:
    # python 3
    from urllib.parse import urlparse
except ImportError:
    from urlparse import urlparse

def url_validator(url):
    try:
        result = urlparse(url)
        components = [result.scheme, result.path]
        if result.netloc != "":
            components.append(result.netloc)
        return all(components)
    except:
        return False

Morlee answered 7/12, 2017 at 11:55 Comment(2)

Fails on a valid URL. >>> url_validator("file:///some_file.txt") False – Quotha 4/4, 2023 at 17:50

made minor changes, you can try again – Morlee 6/4, 2023 at 7:18

Url without schema is actually invalid, your browser is just clever enough to suggest http:// as schema for it. It may be a good solution to check if url doesn't have schema (not re.match(r'^[a-zA-Z]+://', url)) and prepend http:// to it.

Benia answered 12/8, 2014 at 8:13 Comment(0)

Recommended topics

Hot tags