I like to understand how to use a <base href="" />
value for my web crawler, so I tested several combinations with major browsers and finally found something with double slashes I don't understand.
If you don't like to read everything jump to the test results of D and E. Demonstration of all tests:
http://gutt.it/basehref.php
Step by step my test results on calling http://example.com/images.html
:
A - Multiple base href
<html>
<head>
<base target="_blank" />
<base href="http://example.com/images/" />
<base href="http://example.com/" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg">
<img src="./image.jpg">
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>
Conclusion
- only the first
<base>
withhref
counts - a source starting with
/
targets the root ../
goes one folder up
B - Without trailing slash
<html>
<head>
<base href="http://example.com/images" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg">
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>
Conclusion
<base href>
ignores everything after the last slash sohttp://example.com/images
becomeshttp://example.com/
C - How it should be
<html>
<head>
<base href="http://example.com/" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg">
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>
Conclusion
- Same result as in Test B as expected
D - Double Slash
<html>
<head>
<base href="http://example.com/images//" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg">
<img src="./image.jpg">
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg">
</body>
</html>
E - Double Slash with whitespace
<html>
<head>
<base href="http://example.com/images/ /" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg">
</body>
</html>
Both are not "valid" URLs, but real results of my web crawler. Please explain what happend in D and E that ../image.jpg
could be found and why causes the whitespace a difference?
Only for your interest:
<base href="http://example.com//" />
is the same as Test C<base href="http://example.com/ /" />
is completely different. Only../image.jpg
is found<base href="a/" />
finds only/images/image.jpg
<base>
)” – it doesn’t “ignore” it – a leading slash simply refers to the root, no matter what – by definition. This has nothing to do with whether a base is given or not. – Bolster