How to get a web page's source code from Java [duplicate]
Asked Answered
F

3

12

I just want to retrieve any web page's source code from Java. I found lots of solutions so far, but I couldn't find any code that works for all the links below:

The main problem for me is that some codes retrieve web page source code, but with missing ones. For example the code below does not work for the first link.

InputStream is = fURL.openStream(); //fURL can be one of the links above
BufferedReader buffer = null;
buffer = new BufferedReader(new InputStreamReader(is, "iso-8859-9"));

int byteRead;
while ((byteRead = buffer.read()) != -1) {
    builder.append((char) byteRead);
}
buffer.close();
System.out.println(builder.toString());
Frick answered 23/12, 2011 at 13:43 Comment(8)
Note that you'll only get the source that is initially delivered when opening an url. There might be additional content being loaded via AJAX and you'd not see that content when you just read the initial stream. - As an example, open up demo.vaadin.com/sampler in Firefox and then open the page source code. You won't see the source for all the displayed content there.Maryn
@cerq: Depending on your definition of "web page's source code" you can or you cannot do it. For example it can be argued that the "source code" of, say, a webpage generated by a .jsp is the .jsp file itself and not the generated HTML... What you're after is the HTML, not the "source code". In many case the "source code" is on the server and short of pirating the server you simply cannot access it.Seamy
@Maryn i think my problem is about the things you tell. So is there any way to get all displayed content source?Frick
Well, you'd have to execute the JavaScript. Have a look at ScriptEngineManager.Maryn
I happen to be asking the exact same question, if you happen to found the answer, please post it here. Thanks!Cruces
Perhaps a duplicate of: How do you Programmatically Download a Webpage in Java.Amii
People who look for a solution to these kind of problems can try the code below:Reckoner
URL pageURL = new URL("researchgate.net/"); BufferedReader in = new BufferedReader(new InputStreamReader(pageURL.openStream())); String fileName = "C:\\Users\\Ali\\Desktop\\test.html"; PrintWriter writer = new PrintWriter(fileName, "UTF-8"); String inputLine; while ((inputLine = in.readLine()) != null) { System.out.println(inputLine); writer.println(inputLine); } in.close();Reckoner
M
26

Try the following code with an added request property:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;

public class SocketConnection
{
    public static String getURLSource(String url) throws IOException
    {
        URL urlObject = new URL(url);
        URLConnection urlConnection = urlObject.openConnection();
        urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");

        return toString(urlConnection.getInputStream());
    }

    private static String toString(InputStream inputStream) throws IOException
    {
        try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8")))
        {
            String inputLine;
            StringBuilder stringBuilder = new StringBuilder();
            while ((inputLine = bufferedReader.readLine()) != null)
            {
                stringBuilder.append(inputLine);
            }

            return stringBuilder.toString();
        }
    }
}
Mosby answered 23/12, 2011 at 13:46 Comment(4)
Neither your code nor the code i wrote does work the link cumhuriyet.com.tr?hn=298710 please test your code first.Frick
System.out.println(getUrlSource("cumhuriyet.com.tr/?hn=298710")); it's okMosby
It's still working perfectlyFults
Giving no output for https://community.diabetes.org/discussAthwartships
M
3
URL yahoo = new URL("http://www.yahoo.com/");
BufferedReader in = new BufferedReader(
            new InputStreamReader(
            yahoo.openStream()));

String inputLine;

while ((inputLine = in.readLine()) != null)
    System.out.println(inputLine);

in.close();
Mcmann answered 23/12, 2011 at 13:51 Comment(1)
i dont want a code which works for yahoo.com or google.com please check my post twiceFrick
D
2

I am sure that you have found a solution somewhere over the past 2 years but the following is a solution that works for your requested site

package javasandbox;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;

/**
*
* @author Ryan.Oglesby
*/
public class JavaSandbox {

private static String sURL;

/**
 * @param args the command line arguments
 */
public static void main(String[] args) throws MalformedURLException, IOException {
    sURL = "http://www.cumhuriyet.com.tr/?hn=298710";
    System.out.println(sURL);
    URL url = new URL(sURL);
    HttpURLConnection httpCon = (HttpURLConnection) url.openConnection();
    //set http request headers
            httpCon.addRequestProperty("Host", "www.cumhuriyet.com.tr");
            httpCon.addRequestProperty("Connection", "keep-alive");
            httpCon.addRequestProperty("Cache-Control", "max-age=0");
            httpCon.addRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
            httpCon.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36");
            httpCon.addRequestProperty("Accept-Encoding", "gzip,deflate,sdch");
            httpCon.addRequestProperty("Accept-Language", "en-US,en;q=0.8");
            //httpCon.addRequestProperty("Cookie", "JSESSIONID=EC0F373FCC023CD3B8B9C1E2E2F7606C; lang=tr; __utma=169322547.1217782332.1386173665.1386173665.1386173665.1; __utmb=169322547.1.10.1386173665; __utmc=169322547; __utmz=169322547.1386173665.1.1.utmcsr=stackoverflow.com|utmccn=(referral)|utmcmd=referral|utmcct=/questions/8616781/how-to-get-a-web-pages-source-code-from-java; __gads=ID=3ab4e50d8713e391:T=1386173664:S=ALNI_Mb8N_wW0xS_wRa68vhR0gTRl8MwFA; scrElm=body");
            HttpURLConnection.setFollowRedirects(false);
            httpCon.setInstanceFollowRedirects(false);
            httpCon.setDoOutput(true);
            httpCon.setUseCaches(true);

            httpCon.setRequestMethod("GET");

            BufferedReader in = new BufferedReader(new InputStreamReader(httpCon.getInputStream(), "UTF-8"));
            String inputLine;
            StringBuilder a = new StringBuilder();
            while ((inputLine = in.readLine()) != null)
                a.append(inputLine);
            in.close();

            System.out.println(a.toString());

            httpCon.disconnect();
}
}
Declarative answered 4/12, 2013 at 16:29 Comment(2)
a help is never too late. But I tried your code and it doesn't work in many webpages.Cruces
I agree that this segment won't work against all web pages as different pages return the data in different formats and in some cases following redirects may be required for what you want to accomplish. in some cases you may receive the response as a gzip response and you could handle it as follows InputStream gzippedResponse = httpCon.getInputStream(); InputStream ungzippedResponse = new GZIPInputStream(gzippedResponse); InputStreamReader reader = new InputStreamReader(ungzippedResponse, "UTF-8"); StringWriter writer = new StringWriter();Declarative

© 2022 - 2024 — McMap. All rights reserved.