How do you validate a URL with a regular expression in Python?
Asked Answered
K

12

107

I'm building an app on Google App Engine. I'm incredibly new to Python and have been beating my head against the following problem for the past 3 days.

I have a class to represent an RSS Feed and in this class I have a method called setUrl. Input to this method is a URL.

I'm trying to use the re python module to validate off of the RFC 3986 Reg-ex (http://www.ietf.org/rfc/rfc3986.txt)

Below is a snipped which should work?

p = re.compile('^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?')
m = p.match(url)
if m:
  self.url = url
  return url
Klink answered 6/5, 2009 at 0:40 Comment(3)
Ahhh, Excellent. I didn't even now the urlparse module even existed! Thank you all!Klink
urlparse module is renamed to urllib.parse in Python 3. Check docs.python.org/3.7/library/…Switcheroo
Security note: A regex of this kind is potentially vulnerable to a ReDoS attack. Hence, regex should be avoided here and other solutions used if you can.Indeterminism
N
160

An easy way to parse (and validate) URL's is the urlparse (py2, py3) module.

A regex is too much work.


There's no "validate" method because almost anything is a valid URL. There are some punctuation rules for splitting it up. Absent any punctuation, you still have a valid URL.

Check the RFC carefully and see if you can construct an "invalid" URL. The rules are very flexible.

For example ::::: is a valid URL. The path is ":::::". A pretty stupid filename, but a valid filename.

Also, ///// is a valid URL. The netloc ("hostname") is "". The path is "///". Again, stupid. Also valid. This URL normalizes to "///" which is the equivalent.

Something like "bad://///worse/////" is perfectly valid. Dumb but valid.

Bottom Line. Parse it, and look at the pieces to see if they're displeasing in some way.

Do you want the scheme to always be "http"? Do you want the netloc to always be "www.somename.somedomain"? Do you want the path to look unix-like? Or windows-like? Do you want to remove the query string? Or preserve it?

These are not RFC-specified validations. These are validations unique to your application.

Nilsanilsen answered 6/5, 2009 at 1:9 Comment(7)
where is the validate() method on urlparse?Wallinga
The question "is it valid?" isn't easy to answer because almost any string is a valid URL. If the result of parsing gives you a netloc or path you don't like, you could call that "invalid".Nilsanilsen
A regex alone might be too much work, but for virtually every use-case, the use of urlparse needs to be augmented with regexes to validate a netloc (or other part) against a particular standard.Engle
Jonathan is right. I agree that validation isn't easy, but urlparse does zero validation. There's a wide range of very invalid strings which urlparse happily pretends to parse. such as "::::::::::" which any browser would reject. Surely there's a better lib out there for checking some of the basic requirements of a URL.Windbreak
I must agree with @Windbreak since even urlparse({}) where the input is not even a string succeeds without a visible errorGrassquit
@ecerulm is that still true? urlparse(5) or urlparse({5:'foo'}) both raise AttributeErrors.Homemaking
While it is true that there are more restrictive application-specific validations that can and should be implemented, there are generic checks that should also be performed as a first layer of defense. For example there are common characters, such as a space (0x20), that are invalid in a URL path as per RFC 3986. Strict validation must check for invalid bytes that haven't been percent-encoded.Echolalia
A
253

Here's the complete regexp to parse a URL.

(?:https?://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)
\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d
+)){3}))(?::(?:\d+))?)(?:/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA
-F\d]{2}))|[;:@&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d
]{2}))|[;:@&=])*))*)(?:\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d
]{2}))|[;:@&=])*))?)?)|(?:s?ftp://(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),
]|(?:%[a-fA-F\d]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:
%[a-fA-F\d]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\
d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(
?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?))(?:/(?:(?:(?:(?:[a-zA-Z\d$\-
_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!
*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&=])*))*)(?:;type=[AIDaid])?)?)|(?:news
:(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;/?:&=])+@(?:
(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:
(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3})))|(?:[a-zA
-Z](?:[a-zA-Z\d]|[_.+-])*)|\*))|(?:nntp://(?:(?:(?:(?:(?:[a-zA-Z\d](?:
(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA
-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?)/(?:[a-zA-Z](?:[a-
zA-Z\d]|[_.+-])*)(?:/(?:\d+))?)|(?:telnet://(?:(?:(?:(?:(?:[a-zA-Z\d$\
-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Z\d$\-_.+!
*'(),]|(?:%[a-fA-F\d]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(
?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-
Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?))/?)|(?:gopher://(?
:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z
](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?
:\d+))?)(?:/(?:[a-zA-Z\d$\-_.+!*'(),;/?:@&=]|(?:%[a-fA-F\d]{2}))(?:(?:
(?:[a-zA-Z\d$\-_.+!*'(),;/?:@&=]|(?:%[a-fA-F\d]{2}))*)(?:%09(?:(?:(?:[
a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;:@&=])*)(?:%09(?:(?:[a-zA-
Z\d$\-_.+!*'(),;/?:@&=]|(?:%[a-fA-F\d]{2}))*))?)?)?)?)|(?:wais://(?:(?
:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?
:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d
+))?)/(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)(?:(?:/(?:(?:[
a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)/(?:(?:[a-zA-Z\d$\-_.+!*'()
,]|(?:%[a-fA-F\d]{2}))*))|\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-
F\d]{2}))|[;:@&=])*))?)|(?:mailto:(?:(?:[a-zA-Z\d$\-_.+!*'(),;/?:@&=]|
(?:%[a-fA-F\d]{2}))+))|(?:file://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-
Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))
|(?:(?:\d+)(?:\.(?:\d+)){3}))|localhost)?/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!
*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'()
,]|(?:%[a-fA-F\d]{2}))|[?:@&=])*))*))|(?:prospero://(?:(?:(?:(?:(?:[a-
zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d
]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?)/(?:(?:(
?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&=])*)(?:/(?:(?:(?
:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&=])*))*)(?:(?:;(?:(?:
(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&])*)=(?:(?:(?:[a-zA
-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&])*)))*)|(?:ldap://(?:(?:(?
:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?
:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d
+))?))?/(?:(?:(?:(?:(?:(?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa
\d]))|(?:%20))+|(?:OID|oid)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(
?:%20)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-
fA-F\d]{2}))*))(?:(?:(?:%0[Aa])?(?:%20)*)\+(?:(?:%0[Aa])?(?:%20)*)(?:(
?:(?:(?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?
:OID|oid)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[
Aa])?(?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)))*)
(?:(?:(?:(?:%0[Aa])?(?:%20)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))(?:(?:(?:
(?:(?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?:O
ID|oid)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa
])?(?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*))(?:(?
:(?:%0[Aa])?(?:%20)*)\+(?:(?:%0[Aa])?(?:%20)*)(?:(?:(?:(?:(?:[a-zA-Z\d
]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?:OID|oid)\.(?:(?:\d+
)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(
?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)))*))*(?:(?:(?:%0[Aa])?(
?:%20)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))?)(?:\?(?:(?:(?:(?:[a-zA-Z\d$\
-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+)(?:,(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%
[a-fA-F\d]{2}))+))*)?)(?:\?(?:base|one|sub)(?:\?(?:((?:[a-zA-Z\d$\-_.+
!*'(),;/?:@&=]|(?:%[a-fA-F\d]{2}))+)))?)?)?)|(?:(?:z39\.50[rs])://(?:(
?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](
?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\
d+))?)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+)(?:\+(?
:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+))*(?:\?(?:(?:[a-zA-Z\d
$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+))?)?(?:;esn=(?:(?:[a-zA-Z\d$\-_.+!*
'(),]|(?:%[a-fA-F\d]{2}))+))?(?:;rs=(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[
a-fA-F\d]{2}))+)(?:\+(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+
))*)?))|(?:cid:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?
:@&=])*))|(?:mid:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[
;?:@&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?:
@&=])*))?)|(?:vemmi://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-
zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)
(?:\.(?:\d+)){3}))(?::(?:\d+))?)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?
:%[a-fA-F\d]{2}))|[/?:@&=])*)(?:(?:;(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?
:%[a-fA-F\d]{2}))|[/?:@&])*)=(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA
-F\d]{2}))|[/?:@&])*))*))?)|(?:imap://(?:(?:(?:(?:(?:(?:(?:[a-zA-Z\d$\
-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~])+)(?:(?:;[Aa][Uu][Tt][Hh]=(?:\*|
(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~])+))))?)|(?:(
?:;[Aa][Uu][Tt][Hh]=(?:\*|(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\
d]{2}))|[&=~])+)))(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}
))|[&=~])+))?))@)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z
\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\
.(?:\d+)){3}))(?::(?:\d+))?))/(?:(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]
|(?:%[a-fA-F\d]{2}))|[&=~:@/])+)?;[Tt][Yy][Pp][Ee]=(?:[Ll](?:[Ii][Ss][
Tt]|[Ss][Uu][Bb])))|(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{
2}))|[&=~:@/])+)(?:\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}
))|[&=~:@/])+))?(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?
:[1-9]\d*)))?)|(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|
[&=~:@/])+)(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?:[1-9
]\d*)))?(?:/;[Uu][Ii][Dd]=(?:[1-9]\d*))(?:(?:/;[Ss][Ee][Cc][Tt][Ii][Oo
][Nn]=(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~:@/])+))
)?)))?)|(?:nfs:(?:(?://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a
-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+
)(?:\.(?:\d+)){3}))(?::(?:\d+))?)(?:(?:/(?:(?:(?:(?:(?:[a-zA-Z\d\$\-_.
!~*'(),])|(?:%[a-fA-F\d]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA-Z\d\$\-_.!~*
'(),])|(?:%[a-fA-F\d]{2})|[:@&=+])*))*)?)))?)|(?:/(?:(?:(?:(?:(?:[a-zA
-Z\d\$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA-Z\
d\$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:@&=+])*))*)?))|(?:(?:(?:(?:(?:[a
-zA-Z\d\$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA
-Z\d\$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:@&=+])*))*)?)))

Given its complexibility, I think you should go the urlparse way.

For completeness, here's the pseudo-BNF of the above regex (as a documentation):

; The generic form of a URL is:

genericurl     = scheme ":" schemepart

; Specific predefined schemes are defined here; new schemes
; may be registered with IANA

url            = httpurl | ftpurl | newsurl |
                 nntpurl | telneturl | gopherurl |
                 waisurl | mailtourl | fileurl |
                 prosperourl | otherurl

; new schemes follow the general syntax
otherurl       = genericurl

; the scheme is in lower case; interpreters should use case-ignore
scheme         = 1*[ lowalpha | digit | "+" | "-" | "." ]
schemepart     = *xchar | ip-schemepart


; URL schemeparts for ip based protocols:

ip-schemepart  = "//" login [ "/" urlpath ]

login          = [ user [ ":" password ] "@" ] hostport
hostport       = host [ ":" port ]
host           = hostname | hostnumber
hostname       = *[ domainlabel "." ] toplabel
domainlabel    = alphadigit | alphadigit *[ alphadigit | "-" ] alphadigit
toplabel       = alpha | alpha *[ alphadigit | "-" ] alphadigit
alphadigit     = alpha | digit
hostnumber     = digits "." digits "." digits "." digits
port           = digits
user           = *[ uchar | ";" | "?" | "&" | "=" ]
password       = *[ uchar | ";" | "?" | "&" | "=" ]
urlpath        = *xchar    ; depends on protocol see section 3.1

; The predefined schemes:

; FTP (see also RFC959)

ftpurl         = "ftp://" login [ "/" fpath [ ";type=" ftptype ]]
fpath          = fsegment *[ "/" fsegment ]
fsegment       = *[ uchar | "?" | ":" | "@" | "&" | "=" ]
ftptype        = "A" | "I" | "D" | "a" | "i" | "d"

; FILE

fileurl        = "file://" [ host | "localhost" ] "/" fpath

; HTTP

httpurl        = "http://" hostport [ "/" hpath [ "?" search ]]
hpath          = hsegment *[ "/" hsegment ]
hsegment       = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
search         = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

; GOPHER (see also RFC1436)

gopherurl      = "gopher://" hostport [ / [ gtype [ selector
                 [ "%09" search [ "%09" gopher+_string ] ] ] ] ]
gtype          = xchar
selector       = *xchar
gopher+_string = *xchar

; MAILTO (see also RFC822)

mailtourl      = "mailto:" encoded822addr
encoded822addr = 1*xchar               ; further defined in RFC822

; NEWS (see also RFC1036)

newsurl        = "news:" grouppart
grouppart      = "*" | group | article
group          = alpha *[ alpha | digit | "-" | "." | "+" | "_" ]
article        = 1*[ uchar | ";" | "/" | "?" | ":" | "&" | "=" ] "@" host

; NNTP (see also RFC977)

nntpurl        = "nntp://" hostport "/" group [ "/" digits ]

; TELNET

telneturl      = "telnet://" login [ "/" ]

; WAIS (see also RFC1625)

waisurl        = waisdatabase | waisindex | waisdoc
waisdatabase   = "wais://" hostport "/" database
waisindex      = "wais://" hostport "/" database "?" search
waisdoc        = "wais://" hostport "/" database "/" wtype "/" wpath
database       = *uchar
wtype          = *uchar
wpath          = *uchar

; PROSPERO

prosperourl    = "prospero://" hostport "/" ppath *[ fieldspec ]
ppath          = psegment *[ "/" psegment ]
psegment       = *[ uchar | "?" | ":" | "@" | "&" | "=" ]
fieldspec      = ";" fieldname "=" fieldvalue
fieldname      = *[ uchar | "?" | ":" | "@" | "&" ]
fieldvalue     = *[ uchar | "?" | ":" | "@" | "&" ]

; Miscellaneous definitions

lowalpha       = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
                 "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
                 "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
                 "y" | "z"
hialpha        = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
                 "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
                 "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
alpha          = lowalpha | hialpha
digit          = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
                 "8" | "9"
safe           = "$" | "-" | "_" | "." | "+"
extra          = "!" | "*" | "'" | "(" | ")" | ","
national       = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`"
punctuation    = "" | "#" | "%" | 


reserved       = ";" | "/" | "?" | ":" | "@" | "&" | "="
hex            = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                 "a" | "b" | "c" | "d" | "e" | "f"
escape         = "%" hex hex

unreserved     = alpha | digit | safe | extra
uchar          = unreserved | escape
xchar          = unreserved | reserved | escape
digits         = 1*digit
Accidie answered 7/5, 2009 at 15:59 Comment(10)
Was this regular expression generated by script or was it actually written by hand?Legislative
@darkk it compiles fine here. You copy-pasted wrongly, or didn't use a raw string, or something like that.Accidie
What was the original source for this RegExp?Objective
Does not accept https://example.com, sftp://example.com and ftps://example.com. Is this intended?Fedora
i inserted this code for https and it works for me, |(?:https://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?)(?:/(?:(?:(?:(?:[a-zA-Z\d$\-.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;:@&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;:@&=])*))*)(?:\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;:@&=])*))?)?)Superfluity
@Fedora - found the same. Fixed:Veronaveronese
how to use this in javascriptSuperior
In all of its schnufteness it doesn't seem to consider URL fragments?Teniafuge
@Legislative it was written by hand. Zalgo ain't such a bad guy afterall.Cascio
here is a visual: debuggex.com/r/NmFIiM9vZwKdYEKZ you will have to zoom out a bit to see the whole thing! lolQuetzal
N
160

An easy way to parse (and validate) URL's is the urlparse (py2, py3) module.

A regex is too much work.


There's no "validate" method because almost anything is a valid URL. There are some punctuation rules for splitting it up. Absent any punctuation, you still have a valid URL.

Check the RFC carefully and see if you can construct an "invalid" URL. The rules are very flexible.

For example ::::: is a valid URL. The path is ":::::". A pretty stupid filename, but a valid filename.

Also, ///// is a valid URL. The netloc ("hostname") is "". The path is "///". Again, stupid. Also valid. This URL normalizes to "///" which is the equivalent.

Something like "bad://///worse/////" is perfectly valid. Dumb but valid.

Bottom Line. Parse it, and look at the pieces to see if they're displeasing in some way.

Do you want the scheme to always be "http"? Do you want the netloc to always be "www.somename.somedomain"? Do you want the path to look unix-like? Or windows-like? Do you want to remove the query string? Or preserve it?

These are not RFC-specified validations. These are validations unique to your application.

Nilsanilsen answered 6/5, 2009 at 1:9 Comment(7)
where is the validate() method on urlparse?Wallinga
The question "is it valid?" isn't easy to answer because almost any string is a valid URL. If the result of parsing gives you a netloc or path you don't like, you could call that "invalid".Nilsanilsen
A regex alone might be too much work, but for virtually every use-case, the use of urlparse needs to be augmented with regexes to validate a netloc (or other part) against a particular standard.Engle
Jonathan is right. I agree that validation isn't easy, but urlparse does zero validation. There's a wide range of very invalid strings which urlparse happily pretends to parse. such as "::::::::::" which any browser would reject. Surely there's a better lib out there for checking some of the basic requirements of a URL.Windbreak
I must agree with @Windbreak since even urlparse({}) where the input is not even a string succeeds without a visible errorGrassquit
@ecerulm is that still true? urlparse(5) or urlparse({5:'foo'}) both raise AttributeErrors.Homemaking
While it is true that there are more restrictive application-specific validations that can and should be implemented, there are generic checks that should also be performed as a first layer of defense. For example there are common characters, such as a space (0x20), that are invalid in a URL path as per RFC 3986. Strict validation must check for invalid bytes that haven't been percent-encoded.Echolalia
E
28

I'm using the one used by Django and it seems to work pretty well:

def is_valid_url(url):
    import re
    regex = re.compile(
        r'^https?://'  # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?|'  # domain...
        r'localhost|'  # localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
        r'(?::\d+)?'  # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)
    return url is not None and regex.search(url)

You can always check the latest version here: https://github.com/django/django/blob/master/django/core/validators.py#L74

Epizoic answered 3/11, 2011 at 13:49 Comment(2)
when url is empty string, it will return None instead of FalseFiasco
This will result as autobahn.security as being an invalid domain. While it is perfectly valid.Whatley
B
21

I admit, I find your regular expression totally incomprehensible. I wonder if you could use urlparse instead? Something like:

pieces = urlparse.urlparse(url)
assert all([pieces.scheme, pieces.netloc])
assert set(pieces.netloc) <= set(string.letters + string.digits + '-.')  # and others?
assert pieces.scheme in ['http', 'https', 'ftp']  # etc.

It might be slower, and maybe you'll miss conditions, but it seems (to me) a lot easier to read and debug than a regular expression for URLs.

Bystreet answered 6/5, 2009 at 1:18 Comment(5)
+1 for the codinghorror article on Jeff's attempts to do this. I was going to quote the regex for validating a valid email address, but 4K+ characters don't fit in this box. These things are just hard to do, and the best answer probably is a dedicated parser once you manage to find some candidate text to feed it.Couturier
The urlparse module seems a little liberal for validation (for accepting input and normalising it would be perfect however). It accepts things like "invalidurl--" - which I'm almost certain is an invalid URLs(?)Sectary
@dbr: That's why I added the assert statements to my code sample. Like I said, "maybe you'll miss conditions", but that can happen with regular expressions too, and at least this way you can easily tell what you are and aren't testing for.Bystreet
@dbr: I find it funny to see your comment as SO also accepted it :)Discompose
urlparse module is renamed to urllib.parse in Python 3. Check docs.python.org/3.7/library/…Switcheroo
A
11

Nowadays, in 90% of case if you working with URL in Python you probably use python-requests. Hence the question here - why not reuse URL validation from requests?

from requests.models import PreparedRequest
import requests.exceptions


def check_url(url):
    prepared_request = PreparedRequest()
    try:
        prepared_request.prepare_url(url, None)
        return prepared_request.url
    except requests.exceptions.MissingSchema, e:
        raise SomeException

Features:

  • Don't reinvent the wheel
  • DRY
  • Work offline
  • Minimal resource
Asterism answered 14/12, 2015 at 11:54 Comment(0)
S
9

urlparse quite happily takes invalid URLs, it is more a string string-splitting library than any kind of validator. For example:

from urlparse import urlparse
urlparse('http://----')
# returns: ParseResult(scheme='http', netloc='----', path='', params='', query='', fragment='')

Depending on the situation, this might be fine..

If you mostly trust the data, and just want to verify the protocol is HTTP, then urlparse is perfect.

If you want to make the URL is actually a legal URL, use the ridiculous regex

If you want to make sure it's a real web address,

import urllib
try:
    urllib.urlopen(url)
except IOError:
    print "Not a real URL"
Sectary answered 12/3, 2010 at 9:11 Comment(2)
er, what's wrong with "http://----" ?? It's a perfectly valid url! Just set your hostname to "----" and you can use it already!Accidie
@Accidie Unless I'm misreading, hostnames can't start with hypens, en.wikipedia.org/wiki/…Sectary
G
7

http://pypi.python.org/pypi/rfc3987 gives regular expressions for consistency with the rules in RFC 3986 and RFC 3987 (that is, not with scheme-specific rules).

A regexp for IRI_reference is:

(?P<scheme>[a-zA-Z][a-zA-Z0-9+.-]*):(?://(?P<iauthority>(?:(?P<iuserinfo>(?:(?:[
a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U0002
0000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U
00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009ff
fd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U00
0dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:)*)@)?(?P<ihost>\
\[(?:(?:[0-9A-F]{1,4}:){6}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4]
[0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|::(?:[0
-9A-F]{1,4}:){5}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]
?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|[0-9A-F]{1,4}?::(
?:[0-9A-F]{1,4}:){4}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|
[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F
]{1,4}:)?[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){3}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?
:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[
0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,2}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){2}(?:
[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3
}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,3}[0-9A-F]{1,
4})?::(?:[0-9A-F]{1,4}:)(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0
-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-
9A-F]{1,4}:){,4}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]
|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|
(?:(?:[0-9A-F]{1,4}:){,5}[0-9A-F]{1,4})?::[0-9A-F]{1,4}|(?:(?:[0-9A-F]{1,4}:){,6
}[0-9A-F]{1,4})?::|v[0-9A-F]+\\.(?:[a-zA-Z0-9_.~-]|[!$&'()*+,;=]|:)+)\\]|(?:(?:(
?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][
0-9]?))|(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\
U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U000500
00-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00
090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd
\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=])*)(
?::(?P<port>[0-9]*))?)(?P<ipath>(?:/(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\uf
dcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\
U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007f
ffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U0
00bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-
F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>/(?:(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7
ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000
-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U0007
0000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U
000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000eff
fd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff
\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\
U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U000700
00-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U00
0b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd
])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)?)|(?P<ipath>(?:(?:[a-zA-Z0-9._~-]|[\
xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U
00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006ff
fd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U00
0afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-
\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[\xa
0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00
030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd
\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000a
fffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U
000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>))(?:\\?(?P<iquery
>(?:(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U000
1fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\
U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U000900
00-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U00
0d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|[\
ue000-\uf8ff\U000f0000-\U000ffffd\U00100000-\U0010fffd]|/|\\?)*))?(?:\\#(?P<ifra
gment>(?:(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-
\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050
000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U0
0090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfff
d\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|
@)|/|\\?)*))?|(?:(?://(?P<iauthority>(?:(?P<iuserinfo>(?:(?:[a-zA-Z0-9._~-]|[\xa
0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00
030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd
\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000a
fffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U
000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:)*)@)?(?P<ihost>\\[(?:(?:[0-9A-F]{1,
4}:){6}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-
9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|::(?:[0-9A-F]{1,4}:){5}(?:
[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3
}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|[0-9A-F]{1,4}?::(?:[0-9A-F]{1,4}:){4
}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\
.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:)?[0-9A-F]{1
,4})?::(?:[0-9A-F]{1,4}:){3}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-
4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?
:[0-9A-F]{1,4}:){,2}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){2}(?:[0-9A-F]{1,4}:[0-9A
-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][
0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,3}[0-9A-F]{1,4})?::(?:[0-9A-F]{1
,4}:)(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]
?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,4}[0-
9A-F]{1,4})?::(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[
0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}
:){,5}[0-9A-F]{1,4})?::[0-9A-F]{1,4}|(?:(?:[0-9A-F]{1,4}:){,6}[0-9A-F]{1,4})?::|
v[0-9A-F]+\\.(?:[a-zA-Z0-9_.~-]|[!$&'()*+,;=]|:)+)\\]|(?:(?:(?:25[0-5]|2[0-4][0-
9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(?:(?:[a-zA
-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000
-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U0006
0000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U
000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dff
fd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=])*)(?::(?P<port>[0-9]*)
)?)(?P<ipath>(?:/(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U0
0010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fff
d\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U000
8fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\
U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*
+,;=]|:|@)*)*)|(?P<ipath>/(?:(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufd
f0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U000400
00-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00
080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd
\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A
-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0
-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000
-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U0008
0000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U
000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F
]|[!$&'()*+,;=]|:|@)*)*)?)|(?P<ipath>(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\u
fdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd
\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007
fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U
000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A
-F][0-9A-F]|[!$&'()*+,;=]|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf
\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00
040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd
\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000b
fffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][
0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>))(?:\\?(?P<iquery>(?:(?:(?:[a-zA-Z0-9.
_~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U000
2fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\
U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a00
00-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U00
0e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|[\ue000-\uf8ff\U000f000
0-\U000ffffd\U00100000-\U0010fffd]|/|\\?)*))?(?:\\#(?P<ifragment>(?:(?:(?:[a-zA-
Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-
\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060
000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U0
00a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfff
d\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|/|\\?)*))?)

In one line:

(?P<scheme>[a-zA-Z][a-zA-Z0-9+.-]*):(?://(?P<iauthority>(?:(?P<iuserinfo>(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:)*)@)?(?P<ihost>\\[(?:(?:[0-9A-F]{1,4}:){6}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|::(?:[0-9A-F]{1,4}:){5}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|[0-9A-F]{1,4}?::(?:[0-9A-F]{1,4}:){4}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:)?[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){3}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,2}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){2}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,3}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:)(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,4}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,5}[0-9A-F]{1,4})?::[0-9A-F]{1,4}|(?:(?:[0-9A-F]{1,4}:){,6}[0-9A-F]{1,4})?::|v[0-9A-F]+\\.(?:[a-zA-Z0-9_.~-]|[!$&'()*+,;=]|:)+)\\]|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=])*)(?::(?P<port>[0-9]*))?)(?P<ipath>(?:/(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>/(?:(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)?)|(?P<ipath>(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>))(?:\\?(?P<iquery>(?:(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|[\ue000-\uf8ff\U000f0000-\U000ffffd\U00100000-\U0010fffd]|/|\\?)*))?(?:\\#(?P<ifragment>(?:(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|/|\\?)*))?|(?:(?://(?P<iauthority>(?:(?P<iuserinfo>(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:)*)@)?(?P<ihost>\\[(?:(?:[0-9A-F]{1,4}:){6}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|::(?:[0-9A-F]{1,4}:){5}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|[0-9A-F]{1,4}?::(?:[0-9A-F]{1,4}:){4}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:)?[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){3}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,2}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){2}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,3}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:)(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,4}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,5}[0-9A-F]{1,4})?::[0-9A-F]{1,4}|(?:(?:[0-9A-F]{1,4}:){,6}[0-9A-F]{1,4})?::|v[0-9A-F]+\\.(?:[a-zA-Z0-9_.~-]|[!$&'()*+,;=]|:)+)\\]|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=])*)(?::(?P<port>[0-9]*))?)(?P<ipath>(?:/(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>/(?:(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)?)|(?P<ipath>(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>))(?:\\?(?P<iquery>(?:(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|[\ue000-\uf8ff\U000f0000-\U000ffffd\U00100000-\U0010fffd]|/|\\?)*))?(?:\\#(?P<ifragment>(?:(?:(?:[a-zA-Z0-9._~-]|[\xa0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\U00010000-\U0001fffd\U00020000-\U0002fffd\U00030000-\U0003fffd\U00040000-\U0004fffd\U00050000-\U0005fffd\U00060000-\U0006fffd\U00070000-\U0007fffd\U00080000-\U0008fffd\U00090000-\U0009fffd\U000a0000-\U000afffd\U000b0000-\U000bfffd\U000c0000-\U000cfffd\U000d0000-\U000dfffd\U000e1000-\U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|/|\\?)*))?)
Greenstone answered 23/3, 2012 at 10:31 Comment(0)
A
6

note - Lepl is no longer maintained or supported.

RFC 3696 defines "best practices" for URL validation - http://www.faqs.org/rfcs/rfc3696.html

The latest release of Lepl (a Python parser library) includes an implementation of RFC 3696. You would use it something like:

from lepl.apps.rfc3696 import Email, HttpUrl

# compile the validators (do once at start of program)
valid_email = Email()
valid_http_url = HttpUrl()

# use the validators (as often as you like)
if valid_email(some_email):
    # email is ok
else:
    # email is bad
if valid_http_url(some_url):
    # url is ok
else:
    # url is bad

Although the validators are defined in Lepl, which is a recursive descent parser, they are largely compiled internally to regular expressions. That combines the best of both worlds - a (relatively) easy to read definition that can be checked against RFC 3696 and an efficient implementation. There's a post on my blog showing how this simplifies the parser - http://www.acooke.org/cute/LEPLOptimi0.html

Lepl is available at http://www.acooke.org/lepl and the RFC 3696 module is documented at http://www.acooke.org/lepl/rfc3696.html

This is completely new in this release, so may contain bugs. Please contact me if you have any problems and I will fix them ASAP. Thanks.

Antoinetteanton answered 3/5, 2010 at 13:18 Comment(0)
K
4

The regex provided should match any url of the form http://www.ietf.org/rfc/rfc3986.txt; and does when tested in the python interpreter.

What format have the URLs you've been having trouble parsing had?

Knew answered 6/5, 2009 at 0:55 Comment(0)
H
2

modified django url validation regex:

import re

ul = "\u00a1-\uffff"  # Unicode letters range (must not be a raw string).

# IP patterns
ipv4_re = (
    r"(?:0|25[0-5]|2[0-4][0-9]|1[0-9]?[0-9]?|[1-9][0-9]?)"
    r"(?:\.(?:0|25[0-5]|2[0-4][0-9]|1[0-9]?[0-9]?|[1-9][0-9]?)){3}"
)
ipv6_re = r"\[[0-9a-f:.]+\]"  # (simple regex, validated later)

# Host patterns
hostname_re = (
    r"[a-z" + ul + r"0-9](?:[a-z" + ul + r"0-9-]{0,61}[a-z" + ul + r"0-9])?"
)
# Max length for domain name labels is 63 characters per RFC 1034 sec. 3.1
domain_re = r"(?:\.(?!-)[a-z" + ul + r"0-9-]{1,63}(?<!-))*"
tld_re = (
    r"\."  # dot
    r"(?!-)"  # can't start with a dash
    r"(?:[a-z" + ul + "-]{2,63}"  # domain label
    r"|xn--[a-z0-9]{1,59})"  # or punycode label
    r"(?<!-)"  # can't end with a dash
    r"\.?"  # may have a trailing dot
)
host_re = "(" + hostname_re + domain_re + tld_re + "|localhost)"

regex = re.compile(
    r"^(?:http|ftp)s?://" # http(s):// or ftp(s)://
    r"(?:[^\s:@/]+(?::[^\s:@/]*)?@)?"  # user:pass authentication
    r"(?:" + ipv4_re + "|" + ipv6_re + "|" + host_re + ")"
    r"(?::[0-9]{1,5})?"  # port
    r"(?:[/?#][^\s]*)?"  # resource path
    r"\Z",
    re.IGNORECASE,
)

source: https://github.com/django/django/blob/master/django/core/validators.py#L74

Hecklau answered 8/3, 2018 at 16:11 Comment(2)
Seems to work great! But:(referring to the title) what’s modified and why?Flea
@HansBouwmeester I just made it portable by using the native re package. Everything else is exactly the same.Hecklau
P
1

I've needed to do this many times over the years and always end up copying someone else's regular expression who has thought about it way more than I want to think about it.

Having said that, there is a regex in the Django forms code which should do the trick:

http://code.djangoproject.com/browser/django/trunk/django/forms/fields.py#L534

Propend answered 6/5, 2009 at 1:23 Comment(1)
The django URL doesn't seem to handle ftp, etc... Also, it doesn't allow all unicode characters in the domain part, which I think are allowed.Money
Y
0
urlfinders = [
    re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|(((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+)(:[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:@&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]"),
    re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|(((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+)(:[0-9]*)?"),
    re.compile("(~/|/|\\./)([-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:@&=\\?/~\\#\\%]|\\\\
)+"),
    re.compile("'\\<((mailto:)|)[-A-Za-z0-9\\.]+@[-A-Za-z0-9\\.]+"),
]

NOTE: As ugly as it looks in your browser just copy paste and the formatting should be good

Found at the python mailing lists and used for the gnome-terminal

source: http://mail.python.org/pipermail/python-list/2007-January/595436.html

Yanina answered 6/5, 2009 at 0:58 Comment(4)
That "nttp" protocol seems anomalous, and I see it's in the original source too. I wonder how much this regex was tested?Windsail
Ugh, please don't write code like this. If you just want plain vanilla urls use urlparse as suggested in other answers. Save custom regexes for when you are actually trying to match, say, a specific subset of URLs or something else that could be considered a special case. If you find an ugly regex later you will wonder if you were just trying to describe URLs in general or were specifically picking out the cases the rest of your code could handle because of some weird constraint.Durham
@sth: Please look up the NNTP protocol.Windsail
Um, the bit after a "mailto:" protocol has to be a compliant email address, right? Jeffry Friedl in Mastering Regular Expressions worked out an URL that matches such an animal as 4724 bytes long all by itself....Couturier

© 2022 - 2024 — McMap. All rights reserved.