Android Web Scraping with a Headless Browser [closed]
Asked Answered
H

2

38

I have spent a day on researching a library that can be used to accomplish the following:

  • Retrieve the full contents of a webpage like in the background without rendering result to a view.
  • The lib should support pages that fires off ajax requests to load some additional result data after the initial HTML has loaded for example.
  • From the resulting html I need to grab elements in xpath or css selector form.
  • In future I also possibly need to navigate to a next page (fire off events, submitting buttons/links etc)

Here is what I have tried without success:

  • Jsoup: Works great but no support for javascript/ajax (so it does not load full page)
  • Android built in HttpEntity: same problem with javascript/ajax as jsoup
  • HtmlUnit: Looks exactly what I need but after hours cannot get it to work on Android (Other users failed by trying to load the 12MB+ worth of jar files. I myself loaded the full source code and referenced it as a project library only to find that things such as Applets and java.awt (used by HtmlUnit) does not exist in Android).
  • Rhino - I find this very confusing and don't know how to get it working in Android and even if it is what I am looking for.
  • Selenium Driver: Looks like it can work but you don't have an straightforward way to implement it in a headless way so that you don't have the actual html displayed to a view.

I really want HtmlUnit to work as it seems the best suited for my solution. Is there any way or at least another library I have missed that is suitable for my needs?

I am currently using Android Studio 0.1.7 and can move to Ellipse if needed.

Thanks in advance!

Hoenir answered 1/7, 2013 at 7:6 Comment(8)
Seems that there is nothing that can be used for my scenario. I have started working on an Android port for HTMLUnit and hope to have something working soon. I will post here as soon as I have checked in a HtmlUnit branch that anyone can download. Hopefully I can get the HtmlUnit developers involved as it seems there are a lot of interest for an Android port.Hoenir
It's been 4 YEARS AND WE'RE STILL HERE! I'M FACING THE SAME PROBLEM!Oblige
Given the current answers, this should be reworded to not be a library request. It could then be reopened. If you do reword it, please ping me @Makyen, so I can help in getting it reopened.Uranography
Any recommended libraries for 2020?Manufacture
@Manufacture There are quite a few promising posts about selenium being used to web crawl with JS capabilities (in python) but I have yet to get it to work in Android Studio. I'm pretty sure selenium piggybacks off of the local device's webdrivers which makes it difficult to use the popular chromedriver built for windows. Im going to give this answer a shot, but it's amazing that there still isn't a good solution after 7 YEARS of this being posted.Allison
Well actually believe it or not but as this post suggests WebView does seem to do the job more or less, the only caveat that I did not solve yet was navigating to pages bet on the html result of the page #62727065Manufacture
The link to htmlunit android port: github.com/HtmlUnit/htmlunit-androidHalleyhalli
!!!!!!!!!!!! HTMLUNIT IS NOW ON ANDROID: github.com/HtmlUnit/htmlunit-android !!!!!!!!!!!!Alcahest
H
37

Ok after 2 weeks I admit defeat and are using a workaround which works great for me at the moment.

The problem:
It is too difficult to port HTMLUnit to Android (or at least with my level of expertise). I am sure its a worthwhile project (and not that time consuming for experienced java programmer) . I emailed the guys at HTMLUnit and they commented that they are not looking into a port or what effort will be involved but suggested anyone who wants to start with such a project should send an message to their mailing list to get more developers involved (http://htmlunit.sourceforge.net/mail-lists.html).

The workaround:
I used android's built in WebView and overrided the onPageFinished method of Webview class to inject Javascript that grabs all the html after the page has fully loaded. Webview can also be used to called futher javascript actions, clicking buttons, filling in forms etc.

Code:

webView.getSettings().setJavaScriptEnabled(true);
MyJavaScriptInterface jInterface = new MyJavaScriptInterface();
webView.addJavascriptInterface(jInterface, "HtmlViewer");

webView.setWebViewClient(new WebViewClient() {

    @Override
    public void onPageFinished(WebView view, String url) {
       //Load HTML
       webView.loadUrl("javascript:window.HtmlViewer.showHTML('<html>'+document.getElementsByTagName('html')[0].innerHTML+'</html>');");
    }

}

webView.loadUrl(StartURL);
ParseHtml(jInterface.html);   

public class MyJavaScriptInterface {

    public String html;

    @JavascriptInterface
    public void showHTML(String _html) {
        html = _html;
    }
}
Hoenir answered 17/7, 2013 at 15:18 Comment(5)
I am also trying to create an android app but I need to scrape the website first in order to proceed, and that site is also javascript enabled(dynamically loaded), any suggestions ? Thanks!Luigi
this problem is still not solved, htmlunit port for android would be a dream as you can pick up elements from the page and run a .click() method to generate a new page, is there anyway you can do that using the android WebView?Individuation
Can this work while the phone is in standby?Monometallism
@Monometallism Did you find the answer?Padova
What about Retrofit? Has anyone tried? github.com/square/retrofitPretonic
H
0

I have taken the implementation mentioned above (injecting JavaScript) and that works for me. All I do is simply set the visibility of the webview to be hidden under other UI elements. I was also thinking of doing the same with selenium. I have used selenium with Chrome in Python and it's great but like you mentioned it is not easy to not show the browser window. But I think it might be possible to just not show the component in Android. I'll have to try.

Harriot answered 26/5, 2019 at 6:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.