Extremely simple code not working in HtmlUnit
Asked Answered
A

2

2

I'm working with HtmlUnit 2.9 (the stable version that was released this month). Do you have any idea why the following code is not working?

public class Main {

    public static void main(String[] args) {
        WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3_6);
        webClient.setCssEnabled(true);
        webClient.setCssErrorHandler(new SilentCssErrorHandler());
        webClient.setThrowExceptionOnFailingStatusCode(false);
        webClient.setThrowExceptionOnScriptError(false);
        webClient.setRedirectEnabled(false);
        webClient.setAppletEnabled(false);
        webClient.setJavaScriptEnabled(false);
        webClient.setPopupBlockerEnabled(true);
        webClient.setTimeout(60000);
        webClient.setPrintContentOnFailingStatusCode(false);

        System.out.println("This is printed on screen");
        try {
            webClient.getPage("http://www.2cash.info/index.php");
        } catch (Exception e) {
            e.printStackTrace();
        }
        System.out.println("This is NEVER printed on screen");
    }
}

I'm also adding the result of jstack. Notice I've marked a section that gets repeated constantly:

2011-08-26 03:15:45
Full thread dump Java HotSpot(TM) Server VM (20.1-b02 mixed mode):

"Attach Listener" daemon prio=10 tid=0x09520400 nid=0x5363 waiting on condition [0x00000000]
   java.lang.Thread.State: RUNNABLE

"JS executor for com.gargoylesoftware.htmlunit.WebClient@a7c45e" daemon prio=10 tid=0x6feb7400 nid=0x5356 waiting on condition [0x6fcfe000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutor.run(JavaScriptExecutor.java:166)
    at java.lang.Thread.run(Thread.java:662)

"Low Memory Detector" daemon prio=10 tid=0x70204c00 nid=0x5352 runnable [0x00000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x70202800 nid=0x5351 runnable [0x00000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x70200800 nid=0x5350 waiting on condition [0x00000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x09514c00 nid=0x534f runnable [0x00000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x09503400 nid=0x534e in Object.wait() [0x70798000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x76af2ff0> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
    - locked <0x76af2ff0> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x09501c00 nid=0x534d in Object.wait() [0x707e9000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x7675cc58> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:485)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
    - locked <0x7675cc58> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x09482400 nid=0x5349 runnable [0xb6c34000]
   java.lang.Thread.State: RUNNABLE
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.getSlot(ScriptableObject.java:2603)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.defineProperty(ScriptableObject.java:1699)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.configureConstantsPropertiesAndFunctions(JavaScriptEngine.java:350)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.configureClass(JavaScriptEngine.java:330)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.init(JavaScriptEngine.java:199)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.access$000(JavaScriptEngine.java:79)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$1.run(JavaScriptEngine.java:146)
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:537)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:538)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.initialize(JavaScriptEngine.java:157)
    at com.gargoylesoftware.htmlunit.WebClient.initialize(WebClient.java:1141)
    at com.gargoylesoftware.htmlunit.WebWindowImpl.setEnclosedPage(WebWindowImpl.java:109)
    at com.gargoylesoftware.htmlunit.html.FrameWindow.setEnclosedPage(FrameWindow.java:102)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:200)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
    at com.gargoylesoftware.htmlunit.html.BaseFrame.<init>(BaseFrame.java:73)
    at com.gargoylesoftware.htmlunit.html.HtmlInlineFrame.<init>(HtmlInlineFrame.java:46)
    at com.gargoylesoftware.htmlunit.html.DefaultElementFactory.createElementNS(DefaultElementFactory.java:288)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.startElement(HTMLParser.java:506)
    at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
    at org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:1136)
    at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:742)
    at org.cyberneko.html.filters.DefaultFilter.startElement(DefaultFilter.java:136)
    at org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java:278)
    at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2652)
    at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2022)
    at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:908)
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:789)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:225)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)

    <THIS_SECTION_IS_PRINTED_AS_IF_IT_WERE_IN_A_LOOP>
    at com.gargoylesoftware.htmlunit.html.BaseFrame.loadInnerPageIfPossible(BaseFrame.java:149)
    at com.gargoylesoftware.htmlunit.html.BaseFrame.loadInnerPage(BaseFrame.java:99)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.loadFrames(HtmlPage.java:1760)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:194)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:440)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
    </THIS_SECTION_IS_PRINTED_AS_IF_IT_WERE_IN_A_LOOP>

    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358)
    at main.Main.<init>(Main.java:42)
    at main.Main.main(Main.java:23)

"VM Thread" prio=10 tid=0x094fe000 nid=0x534c runnable 

"GC task thread#0 (ParallelGC)" prio=10 tid=0x09489800 nid=0x534a runnable 

"GC task thread#1 (ParallelGC)" prio=10 tid=0x0948ac00 nid=0x534b runnable 

"VM Periodic Task Thread" prio=10 tid=0x70207000 nid=0x5353 waiting on condition 

JNI global references: 1234

I think there is some kind of loop regarding the automatic loading of frames. If that is the case, is there any way to disable that behaviour to break the loop?

Thanks in advance!

Ardussi answered 26/8, 2011 at 6:36 Comment(2)
Are you using Java7? When yes, tried it with Java6?Earthaearthborn
Yes: $ java -version java version "1.6.0_26" Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) Server VM (build 20.1-b02, mixed mode)Ardussi
A
3

Well, although it is a horrible solution (workaround, actually...), I finally decided to disable the automatic loading of frames in HtmlUnit as adviced by one of the developers of HtmlUnit. This is what I did in detail:

  1. Downloaded the HtmlUnit source
  2. Downloaded maven from here
  3. Commented the content (the body of the method, not the declaration) of the loadFrames() method of the HtmlPage class located in htmlunit-2.9/src/main/java/com/gargoylesoftware/htmlunit/html
  4. Compiled this custom code skipping tests with: mvn -Dmaven.test.skip=true clean compile package
  5. Got the new htmlunit-2.9.jar located in htmlunit-2.9/artifacts and replaced the current htmlunit-2.9.jar library file
  6. This step might be the most delicate one as it will depend on each application. However, I'll show you the changes I needed to do to my application.

You know how my original code was (look at the question). That would download all frames and iframes from a page. I'm adding an example on how to get a page with frames just loading the frames you want:

try {
    HtmlPage page = webClient.getPage("http://www.w3schools.com/HTML/tryit.asp?filename=tryhtml_noframes");
    HtmlInlineFrame frame = page.getFirstByXPath("//iframe[@name='view']");
    page = webClient.getPage(page.getFullyQualifiedUrl(frame.getSrcAttribute()));
    System.out.println(page.asXml());
} catch (Exception e) {
    e.printStackTrace();
}

After this library change, the content of the frame will be empty once the getPage() method finishes. Notice it won't be null, looks like it is just returning an empty frame. What we need to do is to download the content of the frames we are interested in manually, that's why I'm performing a getPage() again.

Well this is how I managed to selectively download frames and iframes with HtmlUnit. Any ideas on how to improve this will be appreciated. Anyway, I hope there will be added some way to disable the loading of the frames in HtmlUnit itself in the future, maybe adding a method such as getPage(URL url, boolean downloadFrames) or something.

Hope this helps someone out there!

Ardussi answered 27/8, 2011 at 18:55 Comment(1)
Update: This workaround also seems to work in HtmlUnit 2.10, 2.11 and 2.12Ardussi
E
2

When I open this site in my browser it does not ever finish loading the page. This might be the problem why HtmlUnit crashes, too. Tested with Chrome and FF.

Try loading a more simple site instead and you may know if this crash is site-depended.

Earthaearthborn answered 26/8, 2011 at 9:3 Comment(3)
I only tested it on FF 3.6. As you say the site almost hangs my PC when it loads. However, take into account that Javascript is disabled in my HtmlUnit configuration. Disable it in your browser and the site will load. Besides, the web pages I load are dynamic, I mean, I get unknown links from a known web page. I need to be able to navigate any link without human knowledge about which ones not to clickArdussi
I'm running NoScript (so no JavaScript enabled) and the site loads forever... It does not really hang, but page loading never ends. Stopped it after 30s loading...Earthaearthborn
I noticed the page finishes loading after 62 seconds on FF 3.6 and performs around 700 http requests. HtmlUnit should be able to handle this, but it doesn't. I only need it to return the XML of the main page, without the IFRAMES or even throw an exception or a timeout or something. But not the current behaviour: Hang the Java process, eat my CPU and melt my hardware :) I think a method like webClient.getPageWithoutFrames(URL) would be the solutionArdussi

© 2022 - 2024 — McMap. All rights reserved.