What happens if <base href...> is set with a double slash?
Asked Answered
V

1

6

I like to understand how to use a <base href="" /> value for my web crawler, so I tested several combinations with major browsers and finally found something with double slashes I don't understand.

If you don't like to read everything jump to the test results of D and E. Demonstration of all tests:
http://gutt.it/basehref.php

Step by step my test results on calling http://example.com/images.html:

A - Multiple base href

<html>
<head>
<base target="_blank" />
<base href="http://example.com/images/" />
<base href="http://example.com/" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg">
<img src="./image.jpg">
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>

Conclusion

  • only the first <base> with href counts
  • a source starting with / targets the root
  • ../ goes one folder up

B - Without trailing slash

<html>
<head>
<base href="http://example.com/images" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg">
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>

Conclusion

  • <base href> ignores everything after the last slash so http://example.com/images becomes http://example.com/

C - How it should be

<html>
<head>
<base href="http://example.com/" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg">
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>

Conclusion

  • Same result as in Test B as expected

D - Double Slash

<html>
<head>
<base href="http://example.com/images//" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg">
<img src="./image.jpg">
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg">
</body>
</html>

E - Double Slash with whitespace

<html>
<head>
<base href="http://example.com/images/ /" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg">
</body>
</html>

Both are not "valid" URLs, but real results of my web crawler. Please explain what happend in D and E that ../image.jpg could be found and why causes the whitespace a difference?

Only for your interest:

  • <base href="http://example.com//" /> is the same as Test C
  • <base href="http://example.com/ /" /> is completely different. Only ../image.jpg is found
  • <base href="a/" /> finds only /images/image.jpg
Variola answered 18/3, 2015 at 12:29 Comment(7)
I mean you can simply try that. Why you didn't it?Triazine
@panther Of course I did. You did not read the description. Please remove your downvote and close request.Variola
See https://mcmap.net/q/330045/-url-with-multiple-forward-slashes-does-it-break-anything. By definition you should only user single slashes... double slashes will be handled differently depending on server setup.Drippy
@Drippy If multiple slashes would act as one slash (as described in your link) the Test D and E would result the same as A.Variola
@Drippy I know that my example is not "valid", but this is a real life example through my crawler so I wanted to know how to use base href in this special cases. So finally I don't have influence of the html source.Variola
“a source starting with / targets the root (ignores <base>)” – it doesn’t “ignore” it – a leading slash simply refers to the root, no matter what – by definition. This has nothing to do with whether a base is given or not.Bolster
Now I've added a demo page to the description as well.Variola
B
6

The behavior of base is explained in the HTML spec:

The base element allows authors to specify the document base URL for the purposes of resolving relative URLs.

As shown in your test A, if there are multiple base with href, the document base URL will be the first one.

Resolving relative URLs is done this way:

Apply the URL parser to url, with base as the base URL, with encoding as the encoding.

The URL parsing algorithm is defined in the URL spec.

It's too complex to be explained here in detail. But basically, this is what happens:

  • A relative URL starting with / is calculated with respect to base URL's host.
  • Otherwise, the relative URL is calculated with respect to base URL's last directory.
  • Be aware that if the base path doesn't end with /, the last part will be a file, not a directory.
  • ./ is the current directory
  • ../ goes one directory up

(Probably, "directory" and "file" are not the proper terminology in URLs)

Some examples:

  • http://example.com/images/a/./ is http://example.com/images/a/
  • http://example.com/images/a/../ is http://example.com/images/
  • http://example.com/images//./ is http://example.com/images//
  • http://example.com/images//../ is http://example.com/images/
  • http://example.com/images/./ is http://example.com/images/
  • http://example.com/images/../ is http://example.com/

Note that, in most cases, // will be like /. As said by @poncha,

Unless you're using some kind of URL rewriting (in which case the rewriting rules may be affected by the number of slashes), the uri maps to a path on disk, but in (most?) modern operating systems (Linux/Unix, Windows), multiple path separators in a row do not have any special meaning, so /path/to/foo and /path//to////foo would eventually map to the same file.

However, in general / / won't become //.

You can use the following snippet to resolve your list of relative URLs to absolute ones:

var bases = [
  "http://example.com/images/",
  "http://example.com/images",
  "http://example.com/",
  "http://example.com/images//",
  "http://example.com/images/ /"
];
var urls = [
  "/images/image.jpg",
  "image.jpg",
  "./image.jpg",
  "images/image.jpg",
  "/image.jpg",
  "../image.jpg"
];
function newEl(type, contents) {
  var el = document.createElement(type);
  if(!contents) return el;
  if(!(contents instanceof Array))
    contents = [contents];
  for(var i=0; i<contents.length; ++i)
    if(typeof contents[i] == 'string')
      el.appendChild(document.createTextNode(contents[i]))
    else if(typeof contents[i] == 'object') // contents[i] instanceof Node
      el.appendChild(contents[i])
  return el;
}
function emoticon(str) {
  return {
    'http://example.com/images/image.jpg': 'good',
    'http://example.com/images//image.jpg': 'neutral'
  }[str] || 'bad';
}
var base = document.createElement('base'),
    a = document.createElement('a'),
    output = document.createElement('ul'),
    head = document.getElementsByTagName('head')[0];
head.insertBefore(base, head.firstChild);
for(var i=0; i<bases.length; ++i) {
  base.href = bases[i];
  var test = newEl('li', [
    'Test ' + (i+1) + ': ',
    newEl('span', bases[i])
  ]);
  test.className = 'test';
  var testItems = newEl('ul');
  testItems.className = 'test-items';
  for(var j=0; j<urls.length; ++j) {
    a.href = urls[j];
    var absURL = a.cloneNode(false).href;
      /* Stupid old IE requires cloning
         https://mcmap.net/q/218590/-getting-an-absolute-url-from-a-relative-one-ie6-issue */
    var testItem = newEl('li', [
      newEl('span', urls[j]),
      ' → ',
      newEl('span', absURL)
    ]);
    testItem.className = 'test-item ' + emoticon(absURL);
    testItems.appendChild(testItem);
  }
  test.appendChild(testItems);
  output.appendChild(test);
}
document.body.appendChild(output);
span {
  background: #eef;
}
.test-items {
  display: table;
  border-spacing: .13em;
  padding-left: 1.1em;
  margin-bottom: .3em;
}
.test-item {
  display: table-row;
  position: relative;
  list-style: none;
}
.test-item > span {
  display: table-cell;
}
.test-item:before {
  display: inline-block;
  width: 1.1em;
  height: 1.1em;
  line-height: 1em;
  text-align: center;
  border-radius: 50%;
  margin-right: .4em;
  position: absolute;
  left: -1.1em;
  top: 0;
}
.good:before {
  content: ':)';
  background: #0f0;
}
.neutral:before {
  content: ':|';
  background: #ff0;
}
.bad:before {
  content: ':(';
  background: #f00;
}

You can also play with this snippet:

var resolveURL = (function() {
  var base = document.createElement('base'),
      a = document.createElement('a'),
      head = document.getElementsByTagName('head')[0];
  return function(url, baseurl) {
    if(base) {
      base.href = baseurl;
      head.insertBefore(base, head.firstChild);
    }
    a.href = url;
    var abs = a.cloneNode(false).href;
    /* Stupid old IE requires cloning
       https://mcmap.net/q/218590/-getting-an-absolute-url-from-a-relative-one-ie6-issue */
    if(base)
      head.removeChild(base);
    return abs;
  };
})();
var base = document.getElementById('base'),
    url = document.getElementById('url'),
    abs = document.getElementById('absolute');
base.onpropertychange = url.onpropertychange = function() {
  if (event.propertyName == "value")
    update()
};
(base.oninput = url.oninput = update)();
function update() {
  abs.value = resolveURL(url.value, base.value);
}
label {
  display: block;
  margin: 1em 0;
}
input {
  width: 100%;
}
<label>
  Base url:
  <input id="base" value="http://example.com/images//foo////bar/baz"
         placeholder="Enter your base url here" />
</label>
<label>
  URL to be resolved:
  <input id="url" value="./a/b/../c"
         placeholder="Enter your URL here">
</label>
<label>
  Resulting url:
  <input id="absolute" readonly>
</label>
Biometry answered 18/3, 2015 at 21:1 Comment(1)
Thank you! Now I understand what happens. In Test D // is one level deeper then images/ so ../ targets images/. But if a target contains // as it is with ./image.jpg the second one is ignored. To clear things up I will edit your answer a little bit. Feel free to use it.Variola

© 2022 - 2024 — McMap. All rights reserved.