Validating URLs in Python
Asked Answered
C

6

19

I've been trying to figure out what the best way to validate a URL is (specifically in Python) but haven't really been able to find an answer. It seems like there isn't one known way to validate a URL, and it depends on what URLs you think you may need to validate. As well, I found it difficult to find an easy to read standard for URL structure. I did find the RFCs 3986 and 3987, but they contain much more than just how it is structured.

Am I missing something, or is there no one standard way to validate a URL?

Coussoule answered 6/3, 2014 at 23:4 Comment(2)
what are you asking? You want to know if a domain is in a correct format? Where is your code?Leg
possible duplicate of How do you validate a URL with a regular expression in Python?Ness
H
27

This looks like it might be a duplicate of How do you validate a URL with a regular expression in Python?

You should be able to use the urlparse library described there.

>>> from urllib.parse import urlparse # python2: from urlparse import urlparse
>>> urlparse('actually not a url')
ParseResult(scheme='', netloc='', path='actually not a url', params='', query='', fragment='')
>>> urlparse('http://google.com')
ParseResult(scheme='http', netloc='google.com', path='', params='', query='', fragment='')

call urlparse on the string you want to check and then make sure that the ParseResult has attributes for scheme and netloc

Hi answered 6/3, 2014 at 23:12 Comment(7)
You might want to use rfc3987 (pypi.python.org/pypi/rfc3987) or do more processing on the urlparse result. urlparse won't actually validate a netloc as an "internet url" -- i got bitten by this too. `urlparse('invalidurl') will give you a netloc + scheme.Shiksa
@JonathanVanasco, python -c "import urlparse; print urlparse.urlparse('invalidurl')" gives ParseResult(scheme='', netloc='', path='invalidurl', params='', query='', fragment=''), so no netloc or scheme. But that does look like a better package for this problem, as it also provides validation.Hi
Sorry, the formatting screwed up the display and autolinked on my original comment. I had indtended urlparse.urlparse('http://invalidurl') - notice the scheme was stripped from the original. the urlparse module interprets 'invalidurl' as a hostname for the netloc -- that's a correct interpretation for the general format, but most people don't intend for stuff like that to pass through. i've encountered too many typos like http://example.com -> http://examplecom. if you pass in ip addresses, it doesn't enforce ipv4 or ipv6 either, so it will accept 999.999.999.999.999 too.Shiksa
It does look like that's a more strict parser, but rfc3987 lets through both of those cases as well (999.999.999.999.999.999 and http://examplecom).Hi
In python3 import urllib.parse as urlparseAssembly
@Assembly this should probably be from urllib.parse import urlparse as the code above imports the whole parse moduleChihuahua
So "x://a.bc.1" is a valid URL (scheme='x', netloc='a.bc.1') and apple.de not (scheme='', netloc='') !? Not really practical…Landau
U
22

The original question is a bit old, but you might also want to look at the Validator-Collection library I released a few months back. It includes high-performing regex-based validation of URLs for compliance against the RFC standard. Some details:

  • Tested against Python 2.7, 3.4, 3.5, 3.6, 3.7, and 3.8
  • No dependencies on Python 3.x, one conditional dependency in Python 2.x (drop-in replacement for Python 2.x's buggy re module)
  • Unit tests that cover 100+ different succeeding/failing URL patterns, including non-standard characters and the like. As close to covering the whole spectrum of the RFC standard as I've been able to find.

It's also very easy to use:

from validator_collection import validators, checkers

checkers.is_url('http://www.stackoverflow.com')
# Returns True

checkers.is_url('not a valid url')
# Returns False

value = validators.url('http://www.stackoverflow.com')
# value set to 'http://www.stackoverflow.com'

value = validators.url('not a valid url')
# raises a validator_collection.errors.InvalidURLError (which is a ValueError)

value = validators.url('https://123.12.34.56:1234')
# value set to 'https://123.12.34.56:1234'

value = validators.url('http://10.0.0.1')
# raises a validator_collection.errors.InvalidURLError (which is a ValueError)

value = validators.url('http://10.0.0.1', allow_special_ips = True)
# value set to 'http://10.0.0.1'

In addition, Validator-Collection includes about 60+ other validators, including IP addresses (IPv4 and IPv6), domains, and email addresses as well, so something folks might find useful.

Unthinking answered 22/7, 2018 at 20:48 Comment(4)
This looks like a really nice package. I haven't tried it yet, but it deserves more than 0 upvotes :-).Suavity
This only works with domain names - it doesn't appear to like ip addresses though. proxy.remote.http: 'XX.XXX.X.XXX:XXXX' is not a url. proxy.remote.https: 'XX.XXX.X.XXX:XXXX' is not a url.Denna
Note sure I understand what exactly you mean. The value XX.XXX.X.XXX:XXXX will never validate correctly because a) it does not have a valid protocol, and because b) the port (:XXXX) is not expressed as a valid port address. If you try to validate http://XX.XXX.X.XXX:1234 that will validate correctly. If you try to validate an IP http://123.165.43.12:1234 that will validate as well. What's the exact issue that you're encountering?Unthinking
Also - a follow-up: there are certain special IP addresses (like loopback IPs like 127.0.0.1 or 0.0.0.0) which are considered special cases by the RFCs for URLs and IP addresses. By default, they will fail validation. However, you can have them be allowed (pass validation) by passing the allow_special_ips = True parameter to the validator function. More details in the documentation.Unthinking
I
1

I would use the validators package. Here is the link to the documentation and installation instructions.

It is just as simple as

import validators
url = 'YOUR URL'
validators.url(url)

It will return true if it is, and false if not.

Infralapsarian answered 17/7, 2018 at 21:6 Comment(3)
The following fails print(validators.url("apple.com"))Production
@Production Because that's not a valid url.Gesso
However, I found a case in which validators fails. https:// seekingalpha dot com/article/4353927/track?type=cli....traºnner_utm_.... Elimintating the extra stuff with "..." The "º" is not detected and validators returns True. In fact, this URL is not validGesso
E
1

you can also try using urllib.request to validate by passing the URL in the urlopen function and catching the exception for URLError.

from urllib.request import urlopen, URLError

def validate_web_url(url="http://google"):
    try:
        urlopen(url)
        return True
    except URLError:
        return False

This would return False in this case

Euphorbia answered 18/7, 2018 at 10:44 Comment(1)
Would this work when your working machine has no internet connection?Conventionalism
H
1
def is_link(url):
    url_regex = r'\b((http|https|ftp):\/\/[a-z0-9-]+(\.[a-z0-9-]+)+([\/?].*)?)\b'
    return bool(re.match(url_regex, url, re.IGNORECASE))
Humanize answered 18/5 at 4:42 Comment(1)
Thank you for your interest in contributing to the Stack Overflow community. This question already has a few answers—including one that has been extensively validated by the community. Are you certain your approach hasn’t been given previously? If so, it would be useful to explain how your approach is different, under what circumstances your approach might be preferred, and/or why you think the previous answers aren’t sufficient. Can you kindly edit your answer to offer an explanation?Millrace
G
-1

Assuming you are using python 3, you could use urllib. The code would go something like this:

import urllib.request as req
import urllib.parse as p

def foo():
    url = 'http://bar.com'
    request = req.Request(url)
    try:
        response = req.urlopen(request)
        #response is now a string you can search through containing the page's html
    except:
        #The url wasn't valid

If there is no error on the line "response = ..." then the url is valid.

Geezer answered 6/3, 2014 at 23:26 Comment(2)
This only works if the host has an internet connection, which may not always be true.Hi
It would be preferable to not have to use an internet connection to determine if the URL is valid. Also using Python 2.7, should have specified that in the original question.Coussoule

© 2022 - 2024 — McMap. All rights reserved.