Properly Matching a IDN URL
Asked Answered
M

3

2

I need help building a regular expression that can properly match an URL inside free text.

  • scheme
    • One of the following: ftp, http, https (is ftps a protocol?)
  • optional user (and optional pass)
  • host (with support for IDNs)
    • support for www and sub-domain(s) (with support for IDNs)
    • basic filtering of TLDs ([a-zA-Z]{2,6} is enough I think)
  • optional port number
  • path (optional, with support for Unicode chars)
  • query (optional, with support for Unicode chars)
  • fragment (optional, with support for Unicode chars)

Here is what I could find out about sub-domains:

A "subdomain" expresses relative dependence, not absolute dependence: for example, wikipedia.org comprises a subdomain of the org domain, and en.wikipedia.org comprises a subdomain of the domain wikipedia.org. In theory, this subdivision can go down to 127 levels deep, and each DNS label can contain up to 63 characters, as long as the whole domain name does not exceed a total length of 255 characters.

Regarding the domain name itself I couldn't find any reliable source but I think the regular expression for non-IDNs (I'm not sure how to write a IDN compatible version) is something like:

[0-9a-zA-Z][0-9a-zA-Z\-]{2,62}

Can someone help me out with this regular expression or point me to a good direction?

Methodius answered 29/12, 2009 at 14:39 Comment(2)
With "support for IDNs", do you mean that it should support www.emilvikström.se or just the punycode version www.xn--emilvikstrm-0fb.se ?Gurl
@Emil: emilvikström.se, I believe I should have to use the \p{L} property but I'm not sure.Methodius
N
4

John Gruber, of Daring Fireball fame, had a post recently that detailed his quest for a good URL-recognizing regex string. What he came up with was this:

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

Which apparently does OK with Unicode-containing URLs, as well. You'd need to do the slight modification to it to get the rest of what you're looking for -- the scheme, username, password, etc. Alan Storm wrote a piece explaining Gruber's regex pattern, which I definitely needed (regex is so write-once-have-no-clue-how-to-read-ever-again!).

Nonah answered 29/12, 2009 at 15:6 Comment(6)
This is probably good if you add the username and password part ( protocol://username:[email protected]/path?querystring#anchor )Gurl
I tested this pattern att it works to get the whole URL. Maybe it's easiest to just run the found URLs through parse_url () afterwards.Gurl
@delfuego: How does that regex differ from this one (?:[\w-]+://?|www[.])[^\s<>]+(?:[^[:punct:]\s]|/)?Methodius
Alix, look at the linked Alan Storm piece in my comment for an explanation of each part of John Gruber's regex string, and then you'll see what's missing from yours.Nonah
@Emil: "This function parse_url is not meant to validate the given URL, it only breaks it up into the above listed parts". And the filter extension fails to validate IDN URLs.Methodius
@Alix, that's correct. So in this case, the regex handles finding valid URLs, and then the parse_url function breaks the now-validated URLs into their component parts.Nonah
C
0

If you require the protocol and aren't worried too much about false positives, by far the easiest thing to do is match all non-whitespace characters around ://

Chilung answered 29/12, 2009 at 14:46 Comment(1)
to eliminate the false ones, run the results through filter_var and if that doesnt return false, run it through parse_url to get the components.Observe
S
0

This will get you most of the way there. If you need it more refined please provide test data.

(ftp|https?)://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?
Strengthen answered 29/12, 2009 at 14:47 Comment(2)
Is that a valid URL? from ietf.org/rfc/rfc1738.txt ... only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.Garett
See RFC3490 about internationalized domain names. In technical terms like DNS it is always converted to punycode, but it is shown in applications with the international characters.Gurl

© 2022 - 2024 — McMap. All rights reserved.