How to fetch HTML in Java

S

6

35

Without the use of any external library, what is the simplest way to fetch a website's HTML content into a String?

Smithers answered 28/8, 2008 at 1:20 Comment(1)

possible duplicate of #239047 – Slotter 6/4, 2010 at 5:29

S

47

I'm currently using this:

String content = null;
URLConnection connection = null;
try {
  connection =  new URL("http://www.google.com").openConnection();
  Scanner scanner = new Scanner(connection.getInputStream());
  scanner.useDelimiter("\\Z");
  content = scanner.next();
  scanner.close();
}catch ( Exception ex ) {
    ex.printStackTrace();
}
System.out.println(content);

But not sure if there's a better way.

Smithers answered 28/8, 2008 at 1:21 Comment(4)

Why "\\Z"? Isn't it an EOF on Windows only? I am just guessing here. – Assessor 9/11, 2011 at 20:52

Why do you use "\\Z"? What does it do? I tried without it, it didn't work. – Waldo 3/2, 2017 at 14:3

@MaxHusiv I think it's because if you don't specify a delimiter, scanner.next() will just go through the whole HTML character by character, but if you use a delimiter which won't be found in the HTML, scanner.next() returns the whole thing. – Lodge 15/11, 2020 at 15:27

What import statements do you need for that to work? – Mease 9/9, 2022 at 13:7

M

23

This has worked well for me:

URL url = new URL(theURL);
InputStream is = url.openStream();
int ptr = 0;
StringBuffer buffer = new StringBuffer();
while ((ptr = is.read()) != -1) {
    buffer.append((char)ptr);
}

Not sure at to whether the other solution(s) provided are any more efficient or not.

Milliemillieme answered 29/8, 2008 at 5:11 Comment(5)

Don't you need to include the following? import java.io.* import java.net.* – Dilks 19/10, 2009 at 3:5

Sure, but they're core java so very simple. As for the actual code, the import statements are omitted for clarity. – Milliemillieme 20/10, 2009 at 0:14

after while, you should display the buffer's content too! or write a method where you read it! – Raseda 1/7, 2016 at 7:53

be sure to close the inputstream – Brancusi 3/1, 2017 at 3:34

why have you named the variable ptr? – Shiah 1/1, 2024 at 14:51

C

2

I just left this post in your other thread, though what you have above might work as well. I don't think either would be any easier than the other. The Apache packages can be accessed by just using import org.apache.commons.HttpClient at the top of your code.

Edit: Forgot the link ;)

Contraception answered 28/8, 2008 at 1:31 Comment(1)

Apparently you also have to install the JAR file :) – Dilks 19/10, 2009 at 3:19

M

2

Whilst not vanilla-Java, I'll offer up a simpler solution. Use Groovy ;-)

String siteContent = new URL("http://www.google.com").text

Milliemillieme answered 5/3, 2013 at 9:16 Comment(0)

I

0

 try {
        URL u = new URL("https"+':'+'/'+'/'+"www.Samsung.com"+'/'+"in"+'/');
        URLConnection urlconnect = u.openConnection();
        InputStream stream = urlconnect.getInputStream();
        int i;
        while ((i = stream.read()) != -1) {
            System.out.print((char)i);
        }
    }
    catch (Exception e) {
        System.out.println(e);
    }

Incessant answered 3/7, 2023 at 6:49 Comment(0)

S

-4

Its not library but a tool named curl generally installed in most of the servers or you can easily install in ubuntu by

sudo apt install curl

Then fetch any html page and store it to your local file like an example

curl https://www.facebook.com/ > fb.html

You will get the home page html.You can run it in your browser as well.

Seiber answered 14/7, 2018 at 10:57 Comment(1)

Squints eyes to show shock. This is a Java Question. – Venose 23/12, 2018 at 1:50

Recommended topics

Hot tags