What's the best approach for parsing XML/'screen scraping' in iOS? UIWebview or NSXMLParser?
Asked Answered
C

2

8

I am creating an iOS app that needs to get some data from a web page. My first though was to use NSXMLParser initWithContentsOfURL: and parse the HTML with the NSXMLParser delegate. However this approach seems like it could quickly become painful (if, for example, the HTML changed I would have to rewrite the parsing code which could be awkward).

Seeing as I'm loading a web page I took take a look at UIWebView too. It looks like UIWebView may be the way to go. stringByEvaluatingJavaScriptFromString: seems like a very handy way to extract the data and would allow the javascript to be stored in a separate file that would be easy to edit if the HTML changed. However, using UIWebView seems a bit hacky (seeing as UIWebView is a UIView subclass it may block the main thread, and the docs say that the javascript has a limit of 10MB).

Does anyone have any advice regarding parsing XML/HTML before I get stuck in?

UPDATE:

I wrote a blog post about my solution:HTML parsing/screen scraping in iOS

Cymophane answered 22/8, 2010 at 13:22 Comment(0)
T
6

Parsing HTML with an XML parser usually does not work anyway because many sites have incorrect HTML, which a web browser will deal with, but a strict XML parser like NSXMLParser will totally fail on.

For many scripting languages there are great scraping libraries that are more merciful. Like Python's Beautiful Soup module. Unfortunately I do not know of such modules for Objective-C.

Loading stuff into a UIWebView might be the simplest way to go here. Note that you do not have to put the UIWebView on screen. You can create a separate UIWindow and add the UIWebView to it, so that you do full off-screen rendering. There was a WWDC2009 video about this I think. As you already mention, it will not be lightweight though.

Depending on the data that you want and the complexity of the pages that you need to parse, you might also be able to parse it by using regular expressions or even a hand written parser. I have done this many times, and for simple data this works well.

Tridentum answered 22/8, 2010 at 15:12 Comment(3)
Good answer! I think it's important to note that even correct HTML will be rejected by a strict XML parser - only (correctly written) XHTML really stands a good chance of getting through an XML parser, which really makes your recommendation of UIWebView the most likely best route to go.Cilka
Well don't forget that the UIWebView will also load everything else on the page. Images, Javascript, etc. This could lead to a LOT of memory usage. Personally I would really try a regular expression or hand-written parser first. If that is too difficult then I would go the UIWebView route.Tridentum
Excellent point - I hadn't considered the well-formedness of the markup. That clinches for me.Cymophane
K
10

I've done this a few times. The best approach I've found is to use libxml2 which has a mode for HTML. Then you can use XPath to query the document.

Working with the libxml2 API is not the most enjoyable. So, I usually bring over the XPathQuery.h/.m files documented on this page:

http://cocoawithlove.com/2008/10/using-libxml2-for-parsing-and-xpath.html

Then I fetch the data using a NSConnection and query the data with something like this:

NSArray *tdNodes = PerformHTMLXPathQuery(self.receivedData, @"//td[@class='col-name']/a/span");

Summary:

  1. Add libxml2 to your project, here are some quick instructions for XCode4: http://cmar.me/2011/04/20/adding-libxml2-to-an-xcode-4-project/

  2. Get the XPathQuery.h/.m

  3. Use an XPath statement to query the html document.

Kosaka answered 21/4, 2011 at 21:7 Comment(0)
T
6

Parsing HTML with an XML parser usually does not work anyway because many sites have incorrect HTML, which a web browser will deal with, but a strict XML parser like NSXMLParser will totally fail on.

For many scripting languages there are great scraping libraries that are more merciful. Like Python's Beautiful Soup module. Unfortunately I do not know of such modules for Objective-C.

Loading stuff into a UIWebView might be the simplest way to go here. Note that you do not have to put the UIWebView on screen. You can create a separate UIWindow and add the UIWebView to it, so that you do full off-screen rendering. There was a WWDC2009 video about this I think. As you already mention, it will not be lightweight though.

Depending on the data that you want and the complexity of the pages that you need to parse, you might also be able to parse it by using regular expressions or even a hand written parser. I have done this many times, and for simple data this works well.

Tridentum answered 22/8, 2010 at 15:12 Comment(3)
Good answer! I think it's important to note that even correct HTML will be rejected by a strict XML parser - only (correctly written) XHTML really stands a good chance of getting through an XML parser, which really makes your recommendation of UIWebView the most likely best route to go.Cilka
Well don't forget that the UIWebView will also load everything else on the page. Images, Javascript, etc. This could lead to a LOT of memory usage. Personally I would really try a regular expression or hand-written parser first. If that is too difficult then I would go the UIWebView route.Tridentum
Excellent point - I hadn't considered the well-formedness of the markup. That clinches for me.Cymophane

© 2022 - 2024 — McMap. All rights reserved.