In Java and HtmlUnit, how to wait for a resulting page to finish loading and download it as HTML?
Asked Answered
H

3

8

HtmlUnit is an awesome Java library that allows you to programatically fill out and submit web forms. I'm currently maintaining a pretty old system written in ASP, and instead of manually filling out this one web form on a monthly basis as I'm required, I'm trying to find a way to maybe automate the entire task because I keep forgetting about it. It's a form for retrieving data gathered within a month. Here's what I've coded so far:

WebClient client = new WebClient();
HtmlPage page = client.getPage("http://urlOfTheWebsite.com/search.aspx");

HtmlForm form = page.getFormByName("aspnetForm");       
HtmlSelect frMonth = form.getSelectByName("ctl00$cphContent$ddlStartMonth");
HtmlSelect frDay = form.getSelectByName("ctl00$cphContent$ddlStartDay");
HtmlSelect frYear = form.getSelectByName("ctl00$cphContent$ddlStartYear");
HtmlSelect toMonth = form.getSelectByName("ctl00$cphContent$ddlEndMonth");
HtmlSelect toDay = form.getSelectByName("ctl00$cphContent$ddlEndDay");
HtmlSelect toYear = form.getSelectByName("ctl00$cphContent$ddlEndYear");
HtmlCheckBoxInput games = form.getInputByName("ctl00$cphContent$chkListLottoGame$0");
HtmlSubmitInput submit = form.getInputByName("ctl00$cphContent$btnSearch");

frMonth.setSelectedAttribute("1", true);
frDay.setSelectedAttribute("1", true);
frYear.setSelectedAttribute("2012", true);
toMonth.setSelectedAttribute("1", true);
toDay.setSelectedAttribute("31", true);
toYear.setSelectedAttribute("2012", true);
games.setChecked(true);
submit.click();

After the click(), I'm supposed to wait for the very same web page to finish reloading because somewhere there is a table that displays the results of my search. Then, when the page is done loading, I need to download it as an HTML file (very much like "Save Page As..." in your favorite browser) because I will scrape out the data to compute their totals, and I've already done that using the Jsoup library.

My questions are: 1. How do I programatically wait for the web page to finish loading in HtmlUnit? 2. How do I programatically download the resulting web page as an HTML file?

I've looked into the HtmlUnit docs already and couldn't find a class that'll do what I need.

Heir answered 5/7, 2012 at 6:40 Comment(0)
A
0

How do I programatically download the resulting web page as an HTML file

Try asXml(). Something like:

page = submit.click();
String htmlContent = page.asXml();
File htmlFile = new File("C:/index.html");
PrintWriter pw = new PrintWriter(htmlFile, true);
pw.print(htmlContent);
pw.close();
Artemisia answered 5/7, 2012 at 6:45 Comment(3)
asXml() does work! Do you know anything about waiting for the page to reload though? I tried to make the thread sleep for 30 seconds after my call to click() and successfully wrote the result of asXml() in an HTML file, but while the <select> elements are properly modified, the results don't show in the table. I'm assumming this might be because I need to make a new HtmlPage reference to the resulting one (which is basically just itself too), but how do I do that?Heir
@matkiros There is no benefit of making a thread to sleep since click() is returned immediately with new instance of HtmlPage or a subclass ,i.e you need to do: page = submit.click(); or assign it to a new reference.Artemisia
You're right, I did the page = submit.click() thing, and it also worked as I wanted it to. Thanks!Heir
W
7

Try with these settings:

webClient.waitForBackgroundJavaScript() or

webClient.waitForBackgroundJavaScriptStartingBefore()

I think you need to mention the browser as well.By default it is using IE.You will get more info from here. HTMLUnit doesn't wait for Javascript

Withe answered 5/7, 2012 at 6:55 Comment(4)
I used waitForBackgroundJavaScript() instead of forcing my thread to sleep. What do you mean "mention the browser," though--as in when instantiating the WebClient object? Also, I forgot to mention that I'm doing this all in Ubuntu, so maybe it's Firefox?Heir
Could be.But ideally this difference should not be there.Withe
@matkiros I think he means you need to try changing the browser by passing BrowserVersion.FIREFOX_3_6 or any available versions of browsers to the constructor of WebClient.Artemisia
@Artemisia Right. Well in any case, it doesn't look I have to do that anymore. I was already able to get the page's source by leaving the WebClient constructor empty.Heir
P
1

This example might help you. After you click you need to wait for the page to load. Most of the time its a dynamic page that uses java scripts etc. All the overridden methods are there not to overwhelm you with a lot of console messages. You can implement the one you want.

public static void main(String[] args) throws IOException {
        WebClient webClient = gethtmlUnitClient();
        final HtmlPage page = webClient.getPage("YOUR PAGE");
        webClient.waitForBackgroundJavaScript(60000);
        System.out.println(page);

    }

static public WebClient gethtmlUnitClient() {
        WebClient webClient;
        LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log",
                "org.apache.commons.logging.impl.NoOpLog");
        java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);
        java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
        webClient = new WebClient(BrowserVersion.CHROME);
        webClient.setIncorrectnessListener(new IncorrectnessListener() {
            @Override
            public void notify(String arg0, Object arg1) {
            }
        });
        webClient.setCssErrorHandler(new ErrorHandler() {

            @Override
            public void warning(CSSParseException arg0) throws CSSException {
                // TODO Auto-generated method stub

            }

            @Override
            public void fatalError(CSSParseException arg0) throws CSSException {
                // TODO Auto-generated method stub

            }

            @Override
            public void error(CSSParseException arg0) throws CSSException {
                // TODO Auto-generated method stub

            }
        });
        webClient.setJavaScriptErrorListener(new JavaScriptErrorListener() {

            @Override
            public void timeoutError(HtmlPage arg0, long arg1, long arg2) {
                // TODO Auto-generated method stub

            }

            @Override
            public void scriptException(HtmlPage arg0, ScriptException arg1) {
                // TODO Auto-generated method stub

            }

            @Override
            public void malformedScriptURL(HtmlPage arg0, String arg1, MalformedURLException arg2) {
                // TODO Auto-generated method stub

            }

            @Override
            public void loadScriptError(HtmlPage arg0, URL arg1, Exception arg2) {
                // TODO Auto-generated method stub

            }
        });
        webClient.setHTMLParserListener(new HTMLParserListener() {

            @Override
            public void warning(String arg0, URL arg1, String arg2, int arg3, int arg4, String arg5) {
                // TODO Auto-generated method stub

            }

            @Override
            public void error(String arg0, URL arg1, String arg2, int arg3, int arg4, String arg5) {
                // TODO Auto-generated method stub

            }
        });
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        return webClient;

    }
Perpetual answered 26/8, 2015 at 16:2 Comment(0)
A
0

How do I programatically download the resulting web page as an HTML file

Try asXml(). Something like:

page = submit.click();
String htmlContent = page.asXml();
File htmlFile = new File("C:/index.html");
PrintWriter pw = new PrintWriter(htmlFile, true);
pw.print(htmlContent);
pw.close();
Artemisia answered 5/7, 2012 at 6:45 Comment(3)
asXml() does work! Do you know anything about waiting for the page to reload though? I tried to make the thread sleep for 30 seconds after my call to click() and successfully wrote the result of asXml() in an HTML file, but while the <select> elements are properly modified, the results don't show in the table. I'm assumming this might be because I need to make a new HtmlPage reference to the resulting one (which is basically just itself too), but how do I do that?Heir
@matkiros There is no benefit of making a thread to sleep since click() is returned immediately with new instance of HtmlPage or a subclass ,i.e you need to do: page = submit.click(); or assign it to a new reference.Artemisia
You're right, I did the page = submit.click() thing, and it also worked as I wanted it to. Thanks!Heir

© 2022 - 2024 — McMap. All rights reserved.