HtmlUnit - Convert an HtmlPage into HTML string?

S

6

8

I'm using HtmlUnit to generate the HTML for various pages, but right now, the best I can do to get the page into the raw HTML that the server returns is to convert the HtmlPage into an XML string.

This is somewhat annoying because the XML output is rendered by web browsers differently than the raw HTML would. Is there a way to convert an HtmlPage into raw HTML instead of XML?

Thanks!

Seleucia answered 27/6, 2011 at 18:19 Comment(0)

M

12

page.asXml() will return the HTML. page.asText() returns it rendered down to just text.

Manwell answered 30/6, 2011 at 16:39 Comment(1)

Just want to confirm this only returns text within text nodes and does not include the tags and their attributes. – Camelopardalis 12/11, 2012 at 3:8

B

6

I'm not 100% certain I understood the question correctly, but maybe this will address your issue:

page.getWebResponse().getContentAsString()

Bandanna answered 28/6, 2011 at 10:43 Comment(1)

getWebResponse() returns the original page, without the modifications made by scripts. So asXml() and asText() is a better solution to get the final page. – Shapely 9/9, 2015 at 13:18

S

1

I think there is no direct way to get the final page as HTML. asXml() returns the result as XML, asText() returns the extracted text content.

The best you can do is to use asXml() and "transform" it to HTML:

htmlPage.asXml().replaceFirst("<\\?xml version=\"1.0\" encoding=\"(.+)\"\\?>", "<!DOCTYPE html>")

(Of course you can apply more transformations like converting <br/> to <br> - it depends on your requirements.)

Even the related Google documentation recommends this approach (although they don't apply any transformations):

// return the snapshot
out.println(page.asXml());

Shapely answered 9/9, 2015 at 13:31 Comment(0)

C

0

I dont know the answer short of a switch on Page type and for XmlPage and SgmlPage one must do an innerHTML on the HTML element and manually write out the attributes. Not elegant and exact (its missing the doctype) but it works.

Page.getWebResponse().getContentAsString()

This is incorrect as it returns the text form of the original unrendered, no js bytes. If javascript executes and changes stuff, then this method will not see the changes.

page.asXml() will return the HTML. page.asText() returns it rendered down to just text.

Just want to confirm this only returns text within text nodes and does not include the tags and their attributes. If you wish to take the complete HTML this is not the good enuff.

Camelopardalis answered 12/11, 2012 at 3:11 Comment(0)

P

0

Maybe you want to go with something like this, instead of using the HtmlUnit framework's methods:

try (InputStreamReader isr = new InputStreamReader(url.openConnection().getInputStream());
                 BufferedReader br = new BufferedReader(isr);){

        String line ="";
        String htmlSource ="";

        while((line = br.readLine()) != null)
        {
            htmlSource += line + "\n";
        }


        return htmlSource;

        } catch (IOException e) {
         // TODO Auto-generated catch block
            e.printStackTrace();
        }

Penitential answered 15/5, 2015 at 7:22 Comment(0)

S

0

Here is my solution that works for me:

ScriptResult scriptResult = htmlPage.executeJavaScript("document.documentElement.outerHTML;");
System.out.println(scriptResult.getJavaScriptResult().toString());

Samiel answered 28/10, 2018 at 18:47 Comment(0)

Recommended topics

Hot tags