Regular expression to remove hostname and port from URL?
Asked Answered
A

6

14

I need to write some javascript to strip the hostname:port part from a url, meaning I want to extract the path part only.

i.e. I want to write a function getPath(url) such that getPath("http://host:8081/path/to/something") returns "/path/to/something"

Can this be done using regular expressions?

Araarab answered 14/1, 2009 at 2:51 Comment(3)
This doesn't require regular expressions at all - see my answer :)Diversified
It's not that it doesn't require regular expressions. This shouldn't be done using regular expressions.Replenish
But it's still useful to know.Orlando
B
13

Quick 'n' dirty:

^[^#]*?://.*?(/.*)$

Everything after the hostname and port (including the initial /) is captured in the first group.

Banyan answered 14/1, 2009 at 2:58 Comment(6)
Or in regular expression literal form ("/" needs to be escaped): /^.*?:\/\/.*?(\/.*)$/.exec("example.com/folder/file.ext")[1] gives "/folder/file.ext"Sapient
This regex is wrong. It captures the path, query and fragment in group 1.Addis
Regex isn't necessary at all! Nice though!Diversified
@mikesamuel, The question asked to remove the hostname and port. I'll correct my answer to have a suitable explanation, though.Banyan
@strager, Doesn't this still convert some URLs that have no scheme or authority portions into ones that do. For example #foo://bar//example.com/ has no scheme or authority but your regex will change it into a protocol relative URL that has an authority //example.com/.Addis
@Mike Samuel, That is very true. As I said, it's quick and dirty, and by no means a robust solution. You can work around the issue by using [^#]*? instead of .*? for the protocol. I'll update my answer to reflect this.Banyan
A
29

RFC 3986 ( http://www.ietf.org/rfc/rfc3986.txt ) says in Appendix B

The following line is the regular expression for breaking-down a well-formed URI reference into its components.

  ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
   12            3  4          5       6  7        8 9

The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression as $. For example, matching the above expression to

  http://www.ics.uci.edu/pub/ietf/uri/#Related

results in the following subexpression matches:

  $1 = http:
  $2 = http
  $3 = //www.ics.uci.edu
  $4 = www.ics.uci.edu
  $5 = /pub/ietf/uri/
  $6 = <undefined>
  $7 = <undefined>
  $8 = #Related
  $9 = Related

where <undefined> indicates that the component is not present, as is the case for the query component in the above example. Therefore, we can determine the value of the five components as

  scheme    = $2
  authority = $4
  path      = $5
  query     = $7
  fragment  = $9
Addis answered 14/1, 2009 at 5:19 Comment(2)
The regex is mistakenly surrounded with ** and **.Crunch
A thorough reply and one I found useful—though not as direct as the accepted answer. Thanks.Hostetler
D
14

I know regular expressions are useful but they're not necessary in this situation. The Location object is inherent of all links within the DOM and has a pathname property.

So, to access that property of some random URL you could need to create a new DOM element and then return its pathname.

An example, which will ALWAYS work perfectly:

function getPath(url) {
    var a = document.createElement('a');
    a.href = url;
    return a.pathname.substr(0,1) === '/' ? a.pathname : '/' + a.pathname;
}

jQuery version: (uses regex to add leading slash if needed)

function getPath(url) {
    return $('<a/>').attr('href',url)[0].pathname.replace(/^[^\/]/,'/');
}
Diversified answered 14/1, 2009 at 8:59 Comment(2)
I know it's an old post, but I really like your method J-P :)Exciseman
Note that this will ONLY work if you have a DOM. In environments like node.js or web workers, there is no DOM. (Probably not a common condition in 2009 when this answer was written...)Knorring
B
13

Quick 'n' dirty:

^[^#]*?://.*?(/.*)$

Everything after the hostname and port (including the initial /) is captured in the first group.

Banyan answered 14/1, 2009 at 2:58 Comment(6)
Or in regular expression literal form ("/" needs to be escaped): /^.*?:\/\/.*?(\/.*)$/.exec("example.com/folder/file.ext")[1] gives "/folder/file.ext"Sapient
This regex is wrong. It captures the path, query and fragment in group 1.Addis
Regex isn't necessary at all! Nice though!Diversified
@mikesamuel, The question asked to remove the hostname and port. I'll correct my answer to have a suitable explanation, though.Banyan
@strager, Doesn't this still convert some URLs that have no scheme or authority portions into ones that do. For example #foo://bar//example.com/ has no scheme or authority but your regex will change it into a protocol relative URL that has an authority //example.com/.Addis
@Mike Samuel, That is very true. As I said, it's quick and dirty, and by no means a robust solution. You can work around the issue by using [^#]*? instead of .*? for the protocol. I'll update my answer to reflect this.Banyan
L
4

The window.location object has pathname, search and hash properties which contain what you require.

for this page

location.pathname = '/questions/441755/regular-expression-to-remove-hostname-and-port-from-url'  
location.search = '' //because there is no query string
location.hash = ''

so you could use

var fullpath = location.pathname+location.search+location.hash
Leprechaun answered 14/1, 2009 at 9:19 Comment(0)
C
2

It's very simple:

^\w+:.*?(:)\d*

Trying to find second occurance of ":" followed by number and preceded by http or https.

This works for below two cases

Ex:

http://localhost:8080/myapplication

https://localhost:8080/myapplication

Hope this helps.

Coquito answered 24/5, 2017 at 4:51 Comment(0)
G
1

This regular expression seems to work: (http://[^/])(/.)

As a test I ran this search and replace in a text editor:

 Search: (http://[^/]*)(/.*)
Replace: Part #1: \1\nPart #2: \2  

It converted this this text:

http://host:8081/path/to/something

into this:

Part #1: http://host:8081
Part #2: /path/to/something

and converted this:

https://mcmap.net/q/246049/-regular-expression-to-remove-hostname-and-port-from-url

into this:

Part #1: http://stackoverflow.com
Part #2: /questions/441755/regular-expression-to-remove-hostname-and-port-from-url
Grumous answered 14/1, 2009 at 3:2 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.