htmlunit Cannot read property "push" from undefined
Asked Answered
M

2

7

I'm trying to crawl a website using htmlunit. Whenever I run it though it only outputs the following error:

Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot read property "push" from undefined (https://www.kinoheld.de/dist/prod/0.4.7/widget.js#1)

Now I don't know much about JS, but I read that push is some kind of array operation. This seems standard to me and I don't know why it would not be supported by htmlunit.

Here is the code I'm using so far:

public static void main(String[] args) throws IOException {
    WebClient web = new WebClient(BrowserVersion.FIREFOX_45);
    web.getOptions().setUseInsecureSSL(true);
    String url = "https://www.kinoheld.de/kino-muenchen/royal-filmpalast/vorstellung/280823/?mode=widget&showID=280828#panel-seats";
    web.getOptions().setThrowExceptionOnFailingStatusCode(false);
    web.waitForBackgroundJavaScript(9000);
    HtmlPage response = web.getPage(url);

    System.out.println(response.getTitleText());
}

What am I missing? Is there a way around this or a way to fix this? Thanks in advance!

Muskogean answered 17/11, 2016 at 14:18 Comment(4)
If it's not supported I guess you should request the developers for a new feature.Tighe
When does the error occur? After the web.getPage(url) or the response.getTitleText() call?Crossindex
@Crossindex The error occurs after the web.getPage(url), as I can comment out the response.getTitleText() and it will still be thrown, even when the web.getOptions().setThrowExceptionOnScriptError(false); (see answer below) is inserted.Muskogean
@TilakMadichetti Is there a proper place to do this?Muskogean
C
5

I've encountered a similar problem before. This is an issue with HTML Unit being designed as a test harness framework rather than a web scraping one. Are you running the latest version of HTML Unit?

I was able to run your code by adding both the setThrowExceptionOnScriptError(false) (as mentioned in Coffee Converter's answer) line as well as adding java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF); at the top of the method to disable the log dump. This yielded an output of:

Royal Filmpalast München München | kinoheld.de

Full code is as follows:

public static void main(String[] args) throws IOException {

    java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);

    WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
    String url = "https://www.kinoheld.de/kino-muenchen/royal-filmpalast/vorstellung/280823/?mode=widget&showID=280828#panel-seats";

    webClient.getOptions().setUseInsecureSSL(true);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.waitForBackgroundJavaScript(9000);
    HtmlPage response = webClient.getPage(url);

    System.out.println(response.getTitleText());
}

This was run on RedHat command line with HTML Unit 2.2.1. Hope this helps.

Crossindex answered 23/11, 2016 at 15:9 Comment(0)
R
6

Try adding

web.getOptions().setThrowExceptionOnScriptError(false);

before you try to get the page. This forces htmlunit to ignore the error. However, this might not work 100% of the time if for instance the javascript that throws the error is important to get the data you are scrapping (which it hopefully isn't). If that doesn't work, try using Selenium with ChromeDriver or GhostDriver.

Source

Reprisal answered 22/11, 2016 at 21:27 Comment(4)
Adding that line doesn't work, it stills throws the same error and doesn't get me anywhere... I'll try whatever Selenium is later when I got more time ;)Muskogean
But before the original exception is in the stack trace, with the line you suggested, it now says com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify and then prints the rest of the stack trace.Muskogean
I really wish i could split the 50 points up, while @Crossindex s answer did acutally solve the question, your suggestion might be more helpful for me on the long shot...Muskogean
@Muskogean No worries, happy to helpReprisal
C
5

I've encountered a similar problem before. This is an issue with HTML Unit being designed as a test harness framework rather than a web scraping one. Are you running the latest version of HTML Unit?

I was able to run your code by adding both the setThrowExceptionOnScriptError(false) (as mentioned in Coffee Converter's answer) line as well as adding java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF); at the top of the method to disable the log dump. This yielded an output of:

Royal Filmpalast München München | kinoheld.de

Full code is as follows:

public static void main(String[] args) throws IOException {

    java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);

    WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
    String url = "https://www.kinoheld.de/kino-muenchen/royal-filmpalast/vorstellung/280823/?mode=widget&showID=280828#panel-seats";

    webClient.getOptions().setUseInsecureSSL(true);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.waitForBackgroundJavaScript(9000);
    HtmlPage response = webClient.getPage(url);

    System.out.println(response.getTitleText());
}

This was run on RedHat command line with HTML Unit 2.2.1. Hope this helps.

Crossindex answered 23/11, 2016 at 15:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.