How to get a web page's source code from Java [duplicate]

Asked 23/12, 2011 at 13:43 Answered 4/12, 2013 at 16:29

I just want to retrieve any web page's source code from Java. I found lots of solutions so far, but I couldn't find any code that works for all the links below:

The main problem for me is that some codes retrieve web page source code, but with missing ones. For example the code below does not work for the first link.

InputStream is = fURL.openStream(); //fURL can be one of the links above
BufferedReader buffer = null;
buffer = new BufferedReader(new InputStreamReader(is, "iso-8859-9"));

int byteRead;
while ((byteRead = buffer.read()) != -1) {
    builder.append((char) byteRead);
}
buffer.close();
System.out.println(builder.toString());

Frick answered 23/12, 2011 at 13:43 Comment(8)

Note that you'll only get the source that is initially delivered when opening an url. There might be additional content being loaded via AJAX and you'd not see that content when you just read the initial stream. - As an example, open up demo.vaadin.com/sampler in Firefox and then open the page source code. You won't see the source for all the displayed content there. – Maryn 23/12, 2011 at 13:51

@cerq: Depending on your definition of "web page's source code" you can or you cannot do it. For example it can be argued that the "source code" of, say, a webpage generated by a .jsp is the .jsp file itself and not the generated HTML... What you're after is the HTML, not the "source code". In many case the "source code" is on the server and short of pirating the server you simply cannot access it. – Seamy 23/12, 2011 at 13:53

@Maryn i think my problem is about the things you tell. So is there any way to get all displayed content source? – Frick 23/12, 2011 at 15:26

Well, you'd have to execute the JavaScript. Have a look at ScriptEngineManager. – Maryn 23/12, 2011 at 19:52

I happen to be asking the exact same question, if you happen to found the answer, please post it here. Thanks! – Cruces 3/6, 2014 at 18:55

Perhaps a duplicate of: How do you Programmatically Download a Webpage in Java. – Amii 15/6, 2014 at 1:29

People who look for a solution to these kind of problems can try the code below: – Reckoner 25/2, 2020 at 21:56

URL pageURL = new URL("researchgate.net/"); BufferedReader in = new BufferedReader(new InputStreamReader(pageURL.openStream())); String fileName = "C:\\Users\\Ali\\Desktop\\test.html"; PrintWriter writer = new PrintWriter(fileName, "UTF-8"); String inputLine; while ((inputLine = in.readLine()) != null) { System.out.println(inputLine); writer.println(inputLine); } in.close(); – Reckoner 25/2, 2020 at 21:57

Try the following code with an added request property:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;

public class SocketConnection
{
    public static String getURLSource(String url) throws IOException
    {
        URL urlObject = new URL(url);
        URLConnection urlConnection = urlObject.openConnection();
        urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");

        return toString(urlConnection.getInputStream());
    }

    private static String toString(InputStream inputStream) throws IOException
    {
        try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8")))
        {
            String inputLine;
            StringBuilder stringBuilder = new StringBuilder();
            while ((inputLine = bufferedReader.readLine()) != null)
            {
                stringBuilder.append(inputLine);
            }

            return stringBuilder.toString();
        }
    }
}

Mosby answered 23/12, 2011 at 13:46 Comment(4)

Neither your code nor the code i wrote does work the link cumhuriyet.com.tr?hn=298710 please test your code first. – Frick 23/12, 2011 at 14:23

System.out.println(getUrlSource("cumhuriyet.com.tr/?hn=298710")); it's ok – Mosby 23/12, 2011 at 14:48

It's still working perfectly – Fults 22/6, 2018 at 15:51

Giving no output for https://community.diabetes.org/discuss – Athwartships 31/1, 2019 at 10:59

URL yahoo = new URL("http://www.yahoo.com/");
BufferedReader in = new BufferedReader(
            new InputStreamReader(
            yahoo.openStream()));

String inputLine;

while ((inputLine = in.readLine()) != null)
    System.out.println(inputLine);

in.close();

Mcmann answered 23/12, 2011 at 13:51 Comment(1)

i dont want a code which works for yahoo.com or google.com please check my post twice – Frick 23/12, 2011 at 14:24

I am sure that you have found a solution somewhere over the past 2 years but the following is a solution that works for your requested site

package javasandbox;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;

/**
*
* @author Ryan.Oglesby
*/
public class JavaSandbox {

private static String sURL;

/**
 * @param args the command line arguments
 */
public static void main(String[] args) throws MalformedURLException, IOException {
    sURL = "http://www.cumhuriyet.com.tr/?hn=298710";
    System.out.println(sURL);
    URL url = new URL(sURL);
    HttpURLConnection httpCon = (HttpURLConnection) url.openConnection();
    //set http request headers
            httpCon.addRequestProperty("Host", "www.cumhuriyet.com.tr");
            httpCon.addRequestProperty("Connection", "keep-alive");
            httpCon.addRequestProperty("Cache-Control", "max-age=0");
            httpCon.addRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
            httpCon.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36");
            httpCon.addRequestProperty("Accept-Encoding", "gzip,deflate,sdch");
            httpCon.addRequestProperty("Accept-Language", "en-US,en;q=0.8");
            //httpCon.addRequestProperty("Cookie", "JSESSIONID=EC0F373FCC023CD3B8B9C1E2E2F7606C; lang=tr; __utma=169322547.1217782332.1386173665.1386173665.1386173665.1; __utmb=169322547.1.10.1386173665; __utmc=169322547; __utmz=169322547.1386173665.1.1.utmcsr=stackoverflow.com|utmccn=(referral)|utmcmd=referral|utmcct=/questions/8616781/how-to-get-a-web-pages-source-code-from-java; __gads=ID=3ab4e50d8713e391:T=1386173664:S=ALNI_Mb8N_wW0xS_wRa68vhR0gTRl8MwFA; scrElm=body");
            HttpURLConnection.setFollowRedirects(false);
            httpCon.setInstanceFollowRedirects(false);
            httpCon.setDoOutput(true);
            httpCon.setUseCaches(true);

            httpCon.setRequestMethod("GET");

            BufferedReader in = new BufferedReader(new InputStreamReader(httpCon.getInputStream(), "UTF-8"));
            String inputLine;
            StringBuilder a = new StringBuilder();
            while ((inputLine = in.readLine()) != null)
                a.append(inputLine);
            in.close();

            System.out.println(a.toString());

            httpCon.disconnect();
}
}

Declarative answered 4/12, 2013 at 16:29 Comment(2)

a help is never too late. But I tried your code and it doesn't work in many webpages. – Cruces 3/6, 2014 at 18:51

I agree that this segment won't work against all web pages as different pages return the data in different formats and in some cases following redirects may be required for what you want to accomplish. in some cases you may receive the response as a gzip response and you could handle it as follows

InputStream gzippedResponse = httpCon.getInputStream();                 InputStream ungzippedResponse = new GZIPInputStream(gzippedResponse);                 InputStreamReader reader = new InputStreamReader(ungzippedResponse, "UTF-8");                 StringWriter writer = new StringWriter();

– Declarative 29/5, 2015 at 18:41

Recommended topics

Hot tags