how to fetch base url from the given url using java
Asked Answered
V

2

13

I am trying to fetch base URL using java. I have used jtidy parser in my code to fetch the title. I am getting the title properly using jtidy, but I am not getting the base url from the given URL.

I have some URL as input:

String s1 = "http://staff.unak.is/andy/GameProgramming0910/new_page_2.htm";
String s2 = "http://www.complex.com/pop-culture/2011/04/10-hottest-women-in-fast-and-furious-movies";

From the first string, I want to fetch "http://staff.unak.is/andy/GameProgramming0910/" as a base URL and from the second string, I want "http://www.complex.com/" as a base URL.

I am using code:

URL url = new URL(s1);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
InputStream in = conn.getInputStream();
Document doc = new Tidy().parseDOM(in, null);
String titleText = doc.getElementsByTagName("title").item(0).getFirstChild()
.getNodeValue();

I am getting titletext, but please can let me know how to get base URL from above given URL?

Vacillating answered 16/5, 2011 at 5:49 Comment(1)
What rules would tell you that http://www.complex.com/ is the base url and not http://www.complex.com/pop-culture/2011/04/?Hobbie
M
26

Try to use the java.net.URL class, it will help you:

For the second case, that it is easier, you could use new URL(s2).getHost();

For the first case, you could get the host and also use getFile() method, and remove the string after the last slash ("/"). something like: (code not tested)

URL url = new URL(s1);
String path = url.getFile().substring(0, url.getFile().lastIndexOf('/'));
String base = url.getProtocol() + "://" + url.getHost() + path;
Mottle answered 16/5, 2011 at 8:53 Comment(6)
I voted up, but it seems to me the third statement should be: String base = url.getProtocol() + "://" + url.getHost() + path;Undersized
I THINK that URL getProtocol() returns the "://", but I havent tested :(Mottle
@Mottle at least in Java 6, it doesn't. You must add it. Think that "://" is not part of the protocol name.Nuggar
url string needs check if it has protocol else malformed url exception is thrown.Loath
Looks like in the event of port being different than default, it's better to use url.getAuthority() rather than getHost(). info: docs.oracle.com/javase/tutorial/networking/urls/urlInfo.htmlAtthia
also better to use getPath() instead of getFileName(). getFileName() also returns the query part and that could contain many slashes ...Transpacific
C
11

You use the java.net.URL class to resolve relative URLs.

For the first case: removing the filename from the path:

new URL(new URL(s1), ".").toString()

For the second case: setting the root path:

new URL(new URL(s2), "/").toString()
Clarendon answered 5/11, 2017 at 11:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.