How can I check whether a URL is valid using `urlparse`?
Asked Answered
B

4

14

I want to check whether a URL is valid, before I open it to read data.

I was using the function urlparse from the urlparse package:

if not bool(urlparse.urlparse(url).netloc):
 # do something like: open and read using urllin2

However, I noticed that some valid URLs are treated as broken, for example:

url = upload.wikimedia.org/math/8/8/d/88d27d47cea8c88adf93b1881eda318d.png

This URL is valid (I can open it using my browser).

Is there a better way to check if the URL is valid?

Breena answered 12/8, 2014 at 8:3 Comment(6)
prepend http:// to url without itKarlie
@Karlie But I have a lot of links, and I don't know if it will start with http:// or it will not, if it is a valid url or not. I want to write a function, which will tell me this avoiding this types of mistakes.Breena
If you're going to open it with urllib2 anyway, can't you just open it first and check if the return code equals 200?Saccharo
@Breena this case, I think the regexp is the best wayKarlie
If it's mainly the http:// that's the issue, if(url[:7] != 'http://'):...url = 'http://' + urlVolney
using a try/except would be the best way to goItself
J
13

You can check if the url has the scheme:

>>> url = "no.scheme.com/math/12345.png"
>>> parsed_url = urlparse.urlparse(url)
>>> bool(parsed_url.scheme)
False

If it's the case, you can replace the scheme and get a real valid url:

>>> parsed_url.geturl()
"no.scheme.com/math/12345.png"
>>> parsed_url = parsed_url._replace(**{"scheme": "http"})
>>> parsed_url.geturl()
'http:///no.scheme.com/math/12345.png'
Jemie answered 12/8, 2014 at 8:24 Comment(4)
+1 for the trick with replacing the tuple which I find very elegant (and didn't know about). The only problem here is that the returned url contains three slashes after the scheme as the url with no scheme is interpreted as path instead of netloc. A simple .replace('///', '//') does the trick for me at least.Swabber
You missed import urlparseNolde
@alexey_efimov, the question already said "I was using the argparse package".Jemie
Else, you can simply use import urllib; urllib.parse.urlparse(url, scheme='http') to get the same result..Gidgetgie
R
13

TL;DR: You can't actually. Every answer given already misses 1 or more cases.

  1. String is google.com (invalid since no scheme, even though a browser assumes by default http). Urlparse will be missing scheme and netloc. So all([result.scheme, result.netloc, result.path]) seems to work for this case
  2. String is http://google (invalid since .com is missing). Urlparse will be missing only path. Again all([result.scheme, result.netloc, result.path]) seems to catch this case
  3. String is http://google.com/ (correct). Urlparse will populate scheme, netloc and path. So for this case all([result.scheme, result.netloc, result.path]) works fine
  4. String is http://google.com (correct). Urlparse will be missing only path. So for this case all([result.scheme, result.netloc, result.path]) seems to give a false negative

So from the above cases you see that the one that comes closest to a solution is all([result.scheme, result.netloc, result.path]). But this works only in cases where the url contains a path (even if that is the / path).

Even if you try to enforce a path (i.e urlparse(urljoin(your_url, "/")) you will still get a false positive in case 2

Maybe something more complicated like

final_url = urlparse(urljoin(your_url, "/"))
is_correct = (all([final_url.scheme, final_url.netloc, final_url.path]) 
              and len(final_url.netloc.split(".")) > 1)

Maybe you also want to skip scheme checking and assume http if no scheme. But even this will get you up to a point. Although it covers the above cases, it doesn't fully cover cases where a url contains an ip instead of a hostname. For such cases you will have to validate that the ip is a correct ip. And there are more scenarios as well. See https://en.wikipedia.org/wiki/URL to think even more cases

Recording answered 15/5, 2018 at 14:31 Comment(1)
urljoin and urlparse end up calling urlsplit which may throw a ValueError if there are brackets (IPv6) in what it thinks is the netloc, so exception handling is necessary tooMyke
M
5

You can try the function below which checks scheme, netloc and path variables which comes after parsing the url. Supports both Python 2 and 3.

try:
    # python 3
    from urllib.parse import urlparse
except ImportError:
    from urlparse import urlparse

def url_validator(url):
    try:
        result = urlparse(url)
        components = [result.scheme, result.path]
        if result.netloc != "":
            components.append(result.netloc)
        return all(components)
    except:
        return False
Morlee answered 7/12, 2017 at 11:55 Comment(2)
Fails on a valid URL. >>> url_validator("file:///some_file.txt") FalseQuotha
made minor changes, you can try againMorlee
B
1

Url without schema is actually invalid, your browser is just clever enough to suggest http:// as schema for it. It may be a good solution to check if url doesn't have schema (not re.match(r'^[a-zA-Z]+://', url)) and prepend http:// to it.

Benia answered 12/8, 2014 at 8:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.