Getting parts of a URL (Regex)
Asked Answered
N

30

164

Given the URL (single line):
http://test.example.com/dir/subdir/file.html

How can I extract the following parts using regular expressions:

  1. The Subdomain (test)
  2. The Domain (example.com)
  3. The path without the file (/dir/subdir/)
  4. The file (file.html)
  5. The path with the file (/dir/subdir/file.html)
  6. The URL without the path (http://test.example.com)
  7. (add any other that you think would be useful)

The regex should work correctly even if I enter the following URL:

http://test.example.com/example/example/example.html
Nitriding answered 26/8, 2008 at 11:1 Comment(4)
This is not a direct answer but most web libraries have a function that accomplishes this task. The function is often called something similar to CrackUrl. If such a function exists, use it, it is almost guaranteed to be more reliable and more efficient than any hand-crafted code.Soria
Please explain to us why this needs to be done with a regex. If it's homework, then say that because that's your constraint. Otherwise, there are better language-specific solutions than using a regex.Cyprian
The links to the first and last samples are broken.Tagore
Here you can find how to extract scheme, domain, TLD, port and query path: stackoverflow.com/questions/9760588/…Shf
A
171

A single regex to parse and breakup a full URL including query parameters and anchors e.g.

https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$

RexEx positions:

url: RegExp['$&'],

protocol:RegExp.$2,

host:RegExp.$3,

path:RegExp.$4,

file:RegExp.$6,

query:RegExp.$7,

hash:RegExp.$8

you could then further parse the host ('.' delimited) quite easily.

What I would do is use something like this:

/*
    ^(.*:)//([A-Za-z0-9\-\.]+)(:[0-9]+)?(.*)$
*/
proto $1
host $2
port $3
the-rest $4

the further parse 'the rest' to be as specific as possible. Doing it in one regex is, well, a bit crazy.

Aforementioned answered 26/8, 2008 at 11:1 Comment(17)
The link codesnippets.joyent.com/posts/show/523 does not work as of Oct 20 '10Overwork
The problem is this part: (.*)? Since the Kleene star already accepts 0 or more, the ? part (0 or 1) is confusing it. I fixed it by changing (.*)? to (.+)?. You could also just remove the ?Doone
Good catch Bryan. I'm not going to edit the response, since I quoted it from the (now gone) link, but upvoted your comment so that the clarification is more visible.Aforementioned
The listed regex is a very good answer but isnt quite right, its missing one / from the protocol and overmatching the querystring and collecting the hash in element 7... this fixes these two problems - ^((http[s]?|ftp):\/\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*?)?(#[\w\-]+)?$Beatty
Hi Dve, I've improved it a little more to extract example.com from urls like http://www.example.com:8080/.... Here goes: ^((http[s]?|ftp):\/\/)?\/?([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{2,3}(\.[^:\/\s\.]{2,3})?(:\d+)?)($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$Bruell
and proof that no regexp is perfect, here's one immediate correction: ^((http[s]?|ftp):\/\/)?\/?([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{2,3}(\.[^:\/\s\.]{2,3})?)(:\d+)?($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$Bruell
Am I missing something? This doesn't match any URL that doesn't end with a trailing slash - for example www.google.com - that seems like kind of a big problemUnclean
as you can imagine, there's lots of revisions or variations in the regex. mnacos's comments fix some of that.Aforementioned
seems like neither of these work correctly either. For instance, for the host of google.co.uk it returns co.uk. Seems like the only way to do this is with an exhaustive list of all the ccTLDs in the world, including administrative divisions (.com.gt, ac.uk etc) and working backwardsUnclean
try this url=/^(?:(.*?):\/\/?)?\/?(?:[^\/\.]+\.)*?([^\/\.]+)\.?([^\/]*)(?:([^?]*)?(?:\?([^#]*))?)?(.*)?/ url.exec("any_protocol://sbd.domain.foo.gooo/path/to/file.php?a=1&b=2#hash") ------------------------Enlistment
@mnacos, probably you have made a tricky mistake in your comment from Feb 28 '12 at 14:41. When you'll copy your regexp to vim you will notice that it contains unicode characters like <200c><200b> or that has been made intentional.Envelopment
what if I wrote https? instead of http[s]? ?Substantiate
I modified this regex to identify all parts of the URL (improved version) - code in Python ^((?P<scheme>[^:/?#]+):(?=//))?(//)?(((?P<login>[^:]+)(?::(?P<password>[^@]+)?)?@)?(?P<host>[^@/?#:]*)(?::(?P<port>\d+)?)?)?(?P<path>[^?#]*)(\?(?P<query>[^#]*))?(#(?P<fragment>.*))? code You show this code in action on pythex.orgSarthe
@DrunkenPoney You didn't say why you edited the regex, nor did you leave a comment explaining.Okwu
The regex does not work if the path is just one character long, e.g. https://example.com/a?foo=bar does not match. I do not understand why this restriction is in place (the relevant part of the regex is ([\w\-\.]+[^#?\s]+))Emboss
I hate to post me-too references, but not everyone can use a python regex and the previously posted regular expressions are not language agnostic or fail miserably with urls like mailto:[email protected] or fully authenticated URLs or that don't have a trailing slash or contain ports like jim:[email protected]:8080. This one handles everything I've been able to throw at it. ^(([^@:\/\s]+):\/?)?\/?(([^@:\/\s]+)(:([^@:\/\s]+))?@)?([^@:\/\s]+)(:(\d+))?(((\/\w+)*\/)([\w\-\.]+[^#?\s]*)?(.*)?(#[\w\-]+)?)?$Middleclass
@Middleclass Here's a version of your regex for use in sed (e.g. for shell scripts). ^\(([^@:\/\s]+):\/?\)?\/?\(([^@:\/\s]+)(:([^@:\/\s]+))?@\)?\([^@:\/\s]+\)\(:(\d+)\)?\(((\/\w+)*\/)([\w\-\.]+[^#?\s]*)?(.*)?(#[\w\-]+)?\)?$ Unfortunately, it doesn't handle the case of https://www.example.com?foo=bar (according to https://mcmap.net/q/18712/-ok-to-skip-slash-before-query-string this is legal)Romona
D
116

I'm a few years late to the party, but I'm surprised no one has mentioned the Uniform Resource Identifier specification has a section on parsing URIs with a regular expression. The regular expression, written by Berners-Lee, et al., is:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 12            3  4          5       6  7        8 9

The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression as $. For example, matching the above expression to

http://www.ics.uci.edu/pub/ietf/uri/#Related

results in the following subexpression matches:

$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related

For what it's worth, I found that I had to escape the forward slashes in JavaScript:

^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

Donettedoney answered 5/11, 2014 at 20:22 Comment(6)
great answer! Choosing something from an RFC can surely never bad the wrong thing to doEstrous
this does not parse the query parametersTalmud
This is the best one afaict. Specifically this adresses two problems I have seen with the others: 1: This deals correctly with other protocols, such as ftp:// and mailto://. 2: This deals correctly with username and password. These optional fields are separated by a colon, just like hostname and port, and it will trip up most other regexes I have seen. @RémyDAVID The querystring is also not parsed normally by the browser location object. If you need to parse the query string, have a look at my tiny library for that: uqs.Skijoring
This answer deserves more up-votes because it covers pretty much all the protocols.Medievalist
It breaks when the protocol is implied HTTP with a username/password (an esoteric and technically invalid syntax, I admit):, e.g. user:[email protected] - RFC 3986 says: A path segment that contains a colon character (e.g., "this:that") cannot be used as the first segment of a relative-path reference, as it would be mistaken for a scheme name. Such a segment must be preceded by a dot-segment (e.g., "./this:that") to make a relative- path reference.Calyptra
This does not separate the domain name from the port as in http://www.ics.uci.edu:9000/pub/ietf/uri/#Related.Nonessential
G
88

I realize I'm late to the party, but there is a simple way to let the browser parse a url for you without a regex:

var a = document.createElement('a');
a.href = 'http://www.example.com:123/foo/bar.html?fox=trot#foo';

['href','protocol','host','hostname','port','pathname','search','hash'].forEach(function(k) {
    console.log(k+':', a[k]);
});

/*//Output:
href: http://www.example.com:123/foo/bar.html?fox=trot#foo
protocol: http:
host: www.example.com:123
hostname: www.example.com
port: 123
pathname: /foo/bar.html
search: ?fox=trot
hash: #foo
*/
Gothurd answered 18/9, 2012 at 4:10 Comment(8)
Given that the original question was tagged "language-agnostic", what language is this?Strepitous
note that this solution requires an existence of protocol prefix, for example http://, for correct displaying of protocol, host and hostname properties. Otherwise the the beginning of url until first slash goes to protocol property.Cataldo
I believe this, though simple, but much slower than RegEx parsing.Elvinelvina
Is it supported by all browsers?Eventempered
If we're going this way you can also do var url = new URL(someUrl)Direful
@Strepitous the .forEach and console.log imply that its JavaScript. Unfortunately not language agnostic.Abyssinia
@AUser who cares really? :)Sam
@gman: Unfortunately, the URL() constructor is unimplemented in IE11 and Edge.Marcin
H
35

I found the highest voted answer (hometoast's answer) doesn't work perfectly for me. Two problems:

  1. It can not handle port number.
  2. The hash part is broken.

The following is a modified version:

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$

Position of parts are as follows:

int SCHEMA = 2, DOMAIN = 3, PORT = 5, PATH = 6, FILE = 8, QUERYSTRING = 9, HASH = 12

Edit posted by anon user:

function getFileName(path) {
    return path.match(/^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/[\w\/-]+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$/i)[8];
}
Headrick answered 21/11, 2008 at 16:28 Comment(2)
Beware that it doesn't work if the URL doesn't have a path after the domain -- e.g. http://www.example.com or if the path is a single character like http://www.example.com/a.Nonessential
It also doesn't work for a url with no path but a query string, e.g. example.com?foo=barRomona
T
12

I needed a regular Expression to match all urls and made this one:

/(?:([^\:]*)\:\/\/)?(?:([^\:\@]*)(?:\:([^\@]*))?\@)?(?:([^\/\:]*)\.(?=[^\.\/\:]*\.[^\.\/\:]*))?([^\.\/\:]*)(?:\.([^\/\.\:]*))?(?:\:([0-9]*))?(\/[^\?#]*(?=.*?\/)\/)?([^\?#]*)?(?:\?([^#]*))?(?:#(.*))?/

It matches all urls, any protocol, even urls like

ftp://user:[email protected]:8080/dir1/dir2/file.php?param1=value1#hashtag

The result (in JavaScript) looks like this:

["ftp", "user", "pass", "www.cs", "server", "com", "8080", "/dir1/dir2/", "file.php", "param1=value1", "hashtag"]

An url like

mailto://[email protected]

looks like this:

["mailto", "admin", undefined, "www.cs", "server", "com", undefined, undefined, undefined, undefined, undefined] 
Topsoil answered 15/8, 2012 at 19:56 Comment(1)
If you want to match the whole domain / ip address (not separated by dots) use this one: /(?:([^\:]*)\:\/\/)?(?:([^\:\@]*)(?:\:([^\@]*))?\@)?(?:([^\/\:]*))?(?:\:([0-9]*))?\/(\/[^\?#]*(?=.*?\/)\/)?([^\?#]*)?(?:\?([^#]*))?(?:#(.*))?/Antistrophe
O
11

I was trying to solve this in javascript, which should be handled by:

var url = new URL('http://a:[email protected]:890/path/wah@t/foo.js?foo=bar&bingobang=&[email protected]#foobar/bing/bo@ng?bang');

since (in Chrome, at least) it parses to:

{
  "hash": "#foobar/bing/bo@ng?bang",
  "search": "?foo=bar&bingobang=&[email protected]",
  "pathname": "/path/wah@t/foo.js",
  "port": "890",
  "hostname": "example.com",
  "host": "example.com:890",
  "password": "b",
  "username": "a",
  "protocol": "http:",
  "origin": "http://example.com:890",
  "href": "http://a:[email protected]:890/path/wah@t/foo.js?foo=bar&bingobang=&[email protected]#foobar/bing/bo@ng?bang"
}

However, this isn't cross browser (https://developer.mozilla.org/en-US/docs/Web/API/URL), so I cobbled this together to pull the same parts out as above:

^(?:(?:(([^:\/#\?]+:)?(?:(?:\/\/)(?:(?:(?:([^:@\/#\?]+)(?:\:([^:@\/#\?]*))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((?:\/?(?:[^\/\?#]+\/+)*)(?:[^\?#]*)))?(\?[^#]+)?)(#.*)?

Credit for this regex goes to https://gist.github.com/rpflorence who posted this jsperf http://jsperf.com/url-parsing (originally found here: https://gist.github.com/jlong/2428561#comment-310066) who came up with the regex this was originally based on.

The parts are in this order:

var keys = [
    "href",                    // http://user:[email protected]:81/directory/file.ext?query=1#anchor
    "origin",                  // http://user:[email protected]:81
    "protocol",                // http:
    "username",                // user
    "password",                // pass
    "host",                    // host.com:81
    "hostname",                // host.com
    "port",                    // 81
    "pathname",                // /directory/file.ext
    "search",                  // ?query=1
    "hash"                     // #anchor
];

There is also a small library which wraps it and provides query params:

https://github.com/sadams/lite-url (also available on bower)

If you have an improvement, please create a pull request with more tests and I will accept and merge with thanks.

Oversoul answered 2/7, 2014 at 9:16 Comment(2)
This is great but could really do with a version like this that pulls out subdomains instead of the duplicated host, hostname. So if I had http://test1.dev.mydomain.com/ for example it would pull out test1.dev..Cardigan
This works very well. I have been looking for a way to extract unusual auth parameters from urls, and this works beautifully.Consecrate
D
8

Propose a much more readable solution (in Python, but applies to any regex):

def url_path_to_dict(path):
    pattern = (r'^'
               r'((?P<schema>.+?)://)?'
               r'((?P<user>.+?)(:(?P<password>.*?))?@)?'
               r'(?P<host>.*?)'
               r'(:(?P<port>\d+?))?'
               r'(?P<path>/.*?)?'
               r'(?P<query>[?].*?)?'
               r'$'
               )
    regex = re.compile(pattern)
    m = regex.match(path)
    d = m.groupdict() if m is not None else None

    return d

def main():
    print url_path_to_dict('http://example.example.com/example/example/example.html')

Prints:

{
'host': 'example.example.com', 
'user': None, 
'path': '/example/example/example.html', 
'query': None, 
'password': None, 
'port': None, 
'schema': 'http'
}
Drawl answered 26/7, 2013 at 23:51 Comment(0)
D
6

subdomain and domain are difficult because the subdomain can have several parts, as can the top level domain, http://sub1.sub2.domain.co.uk/

 the path without the file : http://[^/]+/((?:[^/]+/)*(?:[^/]+$)?)  
 the file : http://[^/]+/(?:[^/]+/)*((?:[^/.]+\.)+[^/.]+)$  
 the path with the file : http://[^/]+/(.*)  
 the URL without the path : (http://[^/]+/)  

(Markdown isn't very friendly to regexes)

Destroy answered 26/8, 2008 at 11:17 Comment(1)
Very useful - I added an additional (http(s?)://[^/]+/) to also grab httpsBueschel
C
5

Try the following:

^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?

It supports HTTP / FTP, subdomains, folders, files etc.

I found it from a quick google search:

Link

Connoisseur answered 26/8, 2008 at 11:10 Comment(0)
P
5

This improved version should work as reliably as a parser.

   // Applies to URI, not just URL or URN:
   //    http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Relationship_to_URL_and_URN
   //
   // http://labs.apache.org/webarch/uri/rfc/rfc3986.html#regexp
   //
   // (?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?
   //
   // http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax
   //
   // $@ matches the entire uri
   // $1 matches scheme (ftp, http, mailto, mshelp, ymsgr, etc)
   // $2 matches authority (host, user:pwd@host, etc)
   // $3 matches path
   // $4 matches query (http GET REST api, etc)
   // $5 matches fragment (html anchor, etc)
   //
   // Match specific schemes, non-optional authority, disallow white-space so can delimit in text, and allow 'www.' w/o scheme
   // Note the schemes must match ^[^\s|:/?#]+(?:\|[^\s|:/?#]+)*$
   //
   // (?:()(www\.[^\s/?#]+\.[^\s/?#]+)|(schemes)://([^\s/?#]*))([^\s?#]*)(?:\?([^\s#]*))?(#(\S*))?
   //
   // Validate the authority with an orthogonal RegExp, so the RegExp above won’t fail to match any valid urls.
   function uriRegExp( flags, schemes/* = null*/, noSubMatches/* = false*/ )
   {
      if( !schemes )
         schemes = '[^\\s:\/?#]+'
      else if( !RegExp( /^[^\s|:\/?#]+(?:\|[^\s|:\/?#]+)*$/ ).test( schemes ) )
         throw TypeError( 'expected URI schemes' )
      return noSubMatches ? new RegExp( '(?:www\\.[^\\s/?#]+\\.[^\\s/?#]+|' + schemes + '://[^\\s/?#]*)[^\\s?#]*(?:\\?[^\\s#]*)?(?:#\\S*)?', flags ) :
         new RegExp( '(?:()(www\\.[^\\s/?#]+\\.[^\\s/?#]+)|(' + schemes + ')://([^\\s/?#]*))([^\\s?#]*)(?:\\?([^\\s#]*))?(?:#(\\S*))?', flags )
   }

   // http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes
   function uriSchemesRegExp()
   {
      return 'about|callto|ftp|gtalk|http|https|irc|ircs|javascript|mailto|mshelp|sftp|ssh|steam|tel|view-source|ymsgr'
   }
Pomposity answered 16/9, 2010 at 7:21 Comment(0)
E
5
const URI_RE = /^(([^:\/\s]+):\/?\/?([^\/\s@]*@)?([^\/@:]*)?:?(\d+)?)?(\/[^?]*)?(\?([^#]*))?(#[\s\S]*)?$/;
/**
* GROUP 1 ([scheme][authority][host][port])
* GROUP 2 (scheme)
* GROUP 3 (authority)
* GROUP 4 (host)
* GROUP 5 (port)
* GROUP 6 (path)
* GROUP 7 (?query)
* GROUP 8 (query)
* GROUP 9 (fragment)
*/
URI_RE.exec("https://john:[email protected]:123/forum/questions/?tag=networking&order=newest#top");
URI_RE.exec("/forum/questions/?tag=networking&order=newest#top");
URI_RE.exec("ldap://[2001:db8::7]/c=GB?objectClass?one");
URI_RE.exec("mailto:[email protected]");

Above you can find javascript implementation with modified regex

Evangelize answered 18/4, 2021 at 13:4 Comment(1)
this is amazingQr
T
4
/^((?P<scheme>https?|ftp):\/)?\/?((?P<username>.*?)(:(?P<password>.*?)|)@)?(?P<hostname>[^:\/\s]+)(?P<port>:([^\/]*))?(?P<path>(\/\w+)*\/)(?P<filename>[-\w.]+[^#?\s]*)?(?P<query>\?([^#]*))?(?P<fragment>#(.*))?$/

From my answer on a similar question. Works better than some of the others mentioned because they had some bugs (such as not supporting username/password, not supporting single-character filenames, fragment identifiers being broken).

Toreutic answered 14/1, 2009 at 4:13 Comment(0)
U
2

You can get all the http/https, host, port, path as well as query by using Uri object in .NET. just the difficult task is to break the host into sub domain, domain name and TLD.

There is no standard to do so and can't be simply use string parsing or RegEx to produce the correct result. At first, I am using RegEx function but not all URL can be parse the subdomain correctly. The practice way is to use a list of TLDs. After a TLD for a URL is defined the left part is domain and the remaining is sub domain.

However the list need to maintain it since new TLDs is possible. The current moment I know is publicsuffix.org maintain the latest list and you can use domainname-parser tools from google code to parse the public suffix list and get the sub domain, domain and TLD easily by using DomainName object: domainName.SubDomain, domainName.Domain and domainName.TLD.

This answers also helpfull: Get the subdomain from a URL

CaLLMeLaNN

Undoubted answered 9/10, 2009 at 4:39 Comment(0)
S
2

Here is one that is complete, and doesnt rely on any protocol.

function getServerURL(url) {
        var m = url.match("(^(?:(?:.*?)?//)?[^/?#;]*)");
        console.log(m[1]) // Remove this
        return m[1];
    }

getServerURL("http://dev.test.se")
getServerURL("http://dev.test.se/")
getServerURL("//ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js")
getServerURL("//")
getServerURL("www.dev.test.se/sdas/dsads")
getServerURL("www.dev.test.se/")
getServerURL("www.dev.test.se?abc=32")
getServerURL("www.dev.test.se#abc")
getServerURL("//dev.test.se?sads")
getServerURL("http://www.dev.test.se#321")
getServerURL("http://localhost:8080/sads")
getServerURL("https://localhost:8080?sdsa")

Prints

http://dev.test.se

http://dev.test.se

//ajax.googleapis.com

//

www.dev.test.se

www.dev.test.se

www.dev.test.se

www.dev.test.se

//dev.test.se

http://www.dev.test.se

http://localhost:8080

https://localhost:8080
Solicitor answered 27/12, 2012 at 16:17 Comment(0)
M
2

None of the above worked for me. Here's what I ended up using:

/^(?:((?:https?|s?ftp):)\/\/)([^:\/\s]+)(?::(\d*))?(?:\/([^\s?#]+)?([?][^?#]*)?(#.*)?)?/
Muricate answered 17/1, 2013 at 18:12 Comment(0)
B
2

I like the regex that was published in "Javascript: The Good Parts". Its not too short and not too complex. This page on github also has the JavaScript code that uses it. But it an be adapted for any language. https://gist.github.com/voodooGQ/4057330

Beggary answered 31/5, 2015 at 22:0 Comment(0)
N
1

Java offers a URL class that will do this. Query URL Objects.

On a side note, PHP offers parse_url().

Nephrosis answered 26/8, 2008 at 11:55 Comment(3)
It looks like this doesn't parse out the subdomain though?Wheelman
Asker asked for regex. URL class will open a connection when you create it.Klaraklarika
"URL class will open a connection when you create it" - that's incorrect, only when you call methods like connect(). But it's true that java.net.URL is somewhat heavy. For this use case, java.net.URI is better.Cashandcarry
Q
1

I would recommend not using regex. An API call like WinHttpCrackUrl() is less error prone.

http://msdn.microsoft.com/en-us/library/aa384092%28VS.85%29.aspx

Quirt answered 30/11, 2009 at 19:35 Comment(2)
And also very platform specific.Market
I think the point was to use a library, rather than reinvent the wheel. Ruby, Python, Perl have tools to tear apart URLs so grab those instead of implementing a bad pattern.Tagore
D
1

I tried a few of these that didn't cover my needs, especially the highest voted which didn't catch a url without a path (http://example.com/)

also lack of group names made it unusable in ansible (or perhaps my jinja2 skills are lacking).

so this is my version slightly modified with the source being the highest voted version here:

^((?P<protocol>http[s]?|ftp):\/)?\/?(?P<host>[^:\/\s]+)(?P<path>((\/\w+)*\/)([\w\-\.]+[^#?\s]+))*(.*)?(#[\w\-]+)?$
Difficult answered 23/11, 2016 at 13:53 Comment(0)
P
1

I build this one. Very permissive it's not to check url juste divide it.

^((http[s]?):\/\/)?([a-zA-Z0-9-.]*)?([\/]?[^?#\n]*)?([?]?[^?#\n]*)?([#]?[^?#\n]*)$

  • match 1 : full protocole with :// (http or https)
  • match 2 : protocole without ://
  • match 3 : host
  • match 4 : slug
  • match 5 : param
  • match 6 : anchor

work

http://
https://
www.demo.com
/slug
?foo=bar
#anchor

https://demo.com
https://demo.com/
https://demo.com/slug
https://demo.com/slug/foo
https://demo.com/?foo=bar
https://demo.com/?foo=bar#anchor
https://demo.com/?foo=bar&bar=foo#anchor
https://www.greate-demo.com/

crash

#anchor#
?toto?
Paronymous answered 21/10, 2020 at 17:35 Comment(0)
L
1

I needed some REGEX to parse the components of a URL in Java. This is what I'm using:

"^(?:(http[s]?|ftp):/)?/?" +    // METHOD
"([^:^/^?^#\\s]+)" +            // HOSTNAME
"(?::(\\d+))?" +                // PORT
"([^?^#.*]+)?" +                // PATH
"(\\?[^#.]*)?" +                // QUERY
"(#[\\w\\-]+)?$"                // ID

Java Code Snippet:

final Pattern pattern = Pattern.compile(
        "^(?:(http[s]?|ftp):/)?/?" +    // METHOD
        "([^:^/^?^#\\s]+)" +            // HOSTNAME
        "(?::(\\d+))?" +                // PORT
        "([^?^#.*]+)?" +                // PATH
        "(\\?[^#.]*)?" +                // QUERY
        "(#[\\w\\-]+)?$"                // ID
);
final Matcher matcher = pattern.matcher(url);

System.out.println("     URL: " + url);

if (matcher.matches())
{
    System.out.println("  Method: " + matcher.group(1));
    System.out.println("Hostname: " + matcher.group(2));
    System.out.println("    Port: " + matcher.group(3));
    System.out.println("    Path: " + matcher.group(4));
    System.out.println("   Query: " + matcher.group(5));
    System.out.println("      ID: " + matcher.group(6));
    
    return matcher.group(2);
}

System.out.println();
System.out.println();
Landau answered 1/6, 2021 at 21:7 Comment(1)
The host regex fails on the string saas-dev.com. The returned matches are aa and -dev.com. I used RegExr to test.Niles
N
0

Using http://www.fileformat.info/tool/regex.htm hometoast's regex works great.

But here is the deal, I want to use different regex patterns in different situations in my program.

For example, I have this URL, and I have an enumeration that lists all supported URLs in my program. Each object in the enumeration has a method getRegexPattern that returns the regex pattern which will then be used to compare with a URL. If the particular regex pattern returns true, then I know that this URL is supported by my program. So, each enumeration has it's own regex depending on where it should look inside the URL.

Hometoast's suggestion is great, but in my case, I think it wouldn't help (unless I copy paste the same regex in all enumerations).

That is why I wanted the answer to give the regex for each situation separately. Although +1 for hometoast. ;)

Nitriding answered 26/8, 2008 at 11:23 Comment(0)
N
0

I know you're claiming language-agnostic on this, but can you tell us what you're using just so we know what regex capabilities you have?

If you have the capabilities for non-capturing matches, you can modify hometoast's expression so that subexpressions that you aren't interested in capturing are set up like this:

(?:SOMESTUFF)

You'd still have to copy and paste (and slightly modify) the Regex into multiple places, but this makes sense--you're not just checking to see if the subexpression exists, but rather if it exists as part of a URL. Using the non-capturing modifier for subexpressions can give you what you need and nothing more, which, if I'm reading you correctly, is what you want.

Just as a small, small note, hometoast's expression doesn't need to put brackets around the 's' for 'https', since he only has one character in there. Quantifiers quantify the one character (or character class or subexpression) directly preceding them. So:

https?

would match 'http' or 'https' just fine.

Nikolenikoletta answered 26/8, 2008 at 11:34 Comment(0)
C
0

regexp to get the URL path without the file.

url = 'http://domain/dir1/dir2/somefile' url.scan(/^(http://[^/]+)((?:/[^/]+)+(?=/))?/?(?:[^/]+)?$/i).to_s

It can be useful for adding a relative path to this url.

Carrell answered 16/7, 2009 at 22:22 Comment(0)
U
0

The regex to do full parsing is quite horrendous. I've included named backreferences for legibility, and broken each part into separate lines, but it still looks like this:

^(?:(?P<protocol>\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?
(?P<file>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)
(?:\?(?P<querystring>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?
(?:#(?P<fragment>.*))?$

The thing that requires it to be so verbose is that except for the protocol or the port, any of the parts can contain HTML entities, which makes delineation of the fragment quite tricky. So in the last few cases - the host, path, file, querystring, and fragment, we allow either any html entity or any character that isn't a ? or #. The regex for an html entity looks like this:

$htmlentity = "&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);"

When that is extracted (I used a mustache syntax to represent it), it becomes a bit more legible:

^(?:(?P<protocol>(?:ht|f)tps?|\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:{{htmlentity}}|[^\/?#:])+(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:{{htmlentity}}|[^?#])+)\/)?
(?P<file>(?:{{htmlentity}}|[^?#])+)
(?:\?(?P<querystring>(?:{{htmlentity}};|[^#])+))?
(?:#(?P<fragment>.*))?$

In JavaScript, of course, you can't use named backreferences, so the regex becomes

^(?:(\w+(?=:\/\/))(?::\/\/))?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::([0-9]+))?)\/)?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)(?:\?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?(?:#(.*))?$

and in each match, the protocol is \1, the host is \2, the port is \3, the path \4, the file \5, the querystring \6, and the fragment \7.

User answered 2/9, 2016 at 5:37 Comment(0)
R
0
//USING REGEX
/**
 * Parse URL to get information
 *
 * @param   url     the URL string to parse
 * @return  parsed  the URL parsed or null
 */
var UrlParser = function (url) {
    "use strict";

    var regx = /^(((([^:\/#\?]+:)?(?:(\/\/)((?:(([^:@\/#\?]+)(?:\:([^:@\/#\?]+))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*)))?(\?[^#]+)?)(#.*)?/,
        matches = regx.exec(url),
        parser = null;

    if (null !== matches) {
        parser = {
            href              : matches[0],
            withoutHash       : matches[1],
            url               : matches[2],
            origin            : matches[3],
            protocol          : matches[4],
            protocolseparator : matches[5],
            credhost          : matches[6],
            cred              : matches[7],
            user              : matches[8],
            pass              : matches[9],
            host              : matches[10],
            hostname          : matches[11],
            port              : matches[12],
            pathname          : matches[13],
            segment1          : matches[14],
            segment2          : matches[15],
            search            : matches[16],
            hash              : matches[17]
        };
    }

    return parser;
};

var parsedURL=UrlParser(url);
console.log(parsedURL);
Reuter answered 16/8, 2017 at 8:28 Comment(0)
M
0

I tried this regex for parsing url partitions:

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*))(\?([^#]*))?(#(.*))?$

URL: https://www.google.com/my/path/sample/asd-dsa/this?key1=value1&key2=value2

Matches:

Group 1.    0-7 https:/
Group 2.    0-5 https
Group 3.    8-22    www.google.com
Group 6.    22-50   /my/path/sample/asd-dsa/this
Group 7.    22-46   /my/path/sample/asd-dsa/
Group 8.    46-50   this
Group 9.    50-74   ?key1=value1&key2=value2
Group 10.   51-74   key1=value1&key2=value2
Microclimate answered 22/7, 2020 at 7:25 Comment(0)
T
0

The best answer suggested here didn't work for me because my URLs also contain a port. However modifying it to the following regex worked for me:

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:\d+)?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$
Tellus answered 30/11, 2020 at 8:29 Comment(0)
D
0

For browser / nodejs environment there is a built in URL class which share the same signature it seems. but check out the respective focus for your case.

https://nodejs.org/api/url.html#urlhost

https://developer.mozilla.org/en-US/docs/Web/API/URL

This is how it may be used though.

let url = new URL('https://test.example.com/cats?name=foofy')
url.protocall; // https:
url.hostname; // test.example.com
url.pathname; // /cats
url.search; // ?name=foofy

let params = url.searchParams
let name = params.get('name');// always string I think so parse accordingly

for more on parameters also see https://developer.mozilla.org/en-US/docs/Web/API/URL/searchParams

Derm answered 12/12, 2021 at 18:21 Comment(0)
M
-2
String s = "https://www.thomas-bayer.com/axis2/services/BLZService?wsdl";

String regex = "(^http.?://)(.*?)([/\\?]{1,})(.*)";

System.out.println("1: " + s.replaceAll(regex, "$1"));
System.out.println("2: " + s.replaceAll(regex, "$2"));
System.out.println("3: " + s.replaceAll(regex, "$3"));
System.out.println("4: " + s.replaceAll(regex, "$4"));

Will provide the following output:
1: https://
2: www.thomas-bayer.com
3: /
4: axis2/services/BLZService?wsdl

If you change the URL to
String s = "https://www.thomas-bayer.com?wsdl=qwerwer&ttt=888"; the output will be the following :
1: https://
2: www.thomas-bayer.com
3: ?
4: wsdl=qwerwer&ttt=888

enjoy..
Yosi Lev

Margie answered 24/12, 2015 at 10:55 Comment(1)
Doesn't handle ports. Isn't language agnostic.Actinic

© 2022 - 2024 — McMap. All rights reserved.