Resolving html entities with NSXMLParser on iPhone
Asked Answered
H

6

17

I think I read every single web page relating to this problem but I still cannot find a solution to it, so here I am.

I have an HTML web page which is not under my control and I need to parse it from my iPhone application. Here is a sample of the web page I'm talking about:

<HTML>
  <HEAD>
    <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
  </HEAD>
  <BODY>
    <LI class="bye bye" rel="hello 1">
      <H5 class="onlytext">
        <A name="morning_part">morning</A>
      </H5>
      <DIV class="mydiv">
        <SPAN class="myclass">something about you</SPAN> 
        <SPAN class="anotherclass">
          <A href="http://www.google.it">Bye Bye &egrave; un saluto</A>
        </SPAN>
      </DIV>
    </LI>
  </BODY>
</HTML>

I'm using NSXMLParser and it is going well till it find the è html entity. It calls foundCharacters: for "Bye Bye" and then it calls resolveExternalEntityName:systemID:: with an entityName of "egrave". In this method i'm just returning the character "è" trasformed in an NSData, the foundCharacters is called again adding the string "è" to the previous one "Bye Bye " and then the parser raise the NSXMLParserUndeclaredEntityError error.

I have no DTD and I cannot change the html file I'm parsing. Do you have any ideas on this problem?

Update (12/03/2010). After the suggestion of Griffo I ended up with something like this:

data = [self replaceHtmlEntities:data];
NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser parse];

where replaceHtmlEntities:(NSData *) is something like this:

- (NSData *)replaceHtmlEntities:(NSData *)data {
    
    NSString *htmlCode = [[NSString alloc] initWithData:data encoding:NSISOLatin1StringEncoding];
    NSMutableString *temp = [NSMutableString stringWithString:htmlCode];
    
    [temp replaceOccurrencesOfString:@"&amp;" withString:@"&" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
    [temp replaceOccurrencesOfString:@"&nbsp;" withString:@" " options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
    ...
    [temp replaceOccurrencesOfString:@"&Agrave;" withString:@"À" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];

    NSData *finalData = [temp dataUsingEncoding:NSISOLatin1StringEncoding];
    return finalData;
    
}

But I am still looking the best way to solve this problem. I will try TouchXml in the next days but I still think that there should be a way to do this using NSXMLParser API, so if you know how, feel free to write it here.

Haema answered 3/3, 2010 at 11:43 Comment(3)
Ps. I know that NSXMLParser is an XML parser and not an HTML parser but i read that the same problem exists for libxml2. NSXMLParser seems to be more easy to learn than libxml2 so i first tried this one hoping it was working. If there is no solution to this then i'll have to switch to libxml2...Haema
As suggested by Griffo below, i replaced every html entity in the text with the appropriate character and then parsed it with NSXMLParser. Now it is working but i really would like to understand which is the better way to solve this kind of problem.Haema
I noticed this with the &amp; entity for the ampersand character '&', at least with regard to multiple "foundCharacters" calls, which is painful to deal with.Delmardelmer
A
10

After exploring several alternatives, it appears that NSXMLParser will not support entities other than the standard entities &lt;, &gt;, &apos;, &quot; and &amp;

The code below fails resulting in an NSXMLParserUndeclaredEntityError.


// Create a dictionary to hold the entities and NSString equivalents
// A complete list of entities and unicode values is described in the HTML DTD
// which is available for download http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent


NSDictionary *entityMap = [NSDictionary dictionaryWithObjectsAndKeys: 
                     [NSString stringWithFormat:@"%C", 0x00E8], @"egrave",
                     [NSString stringWithFormat:@"%C", 0x00E0], @"agrave", 
                     ...
                     ,nil];

NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser setShouldResolveExternalEntities:YES];
[parser parse];

// NSXMLParser delegate method
- (NSData *)parser:(NSXMLParser *)parser resolveExternalEntityName:(NSString *)entityName systemID:(NSString *)systemID {
    return [[entityMap objectForKey:entityName] dataUsingEncoding: NSUTF8StringEncoding];
}

Attempts to declare the entities by prepending the HTML document with ENTITY declarations will pass, however the expanded entities are not passed back to parser:foundCharacters and the è and à characters are dropped.

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
[
  <!ENTITY agrave "à">
  <!ENTITY egrave "è">
]>

In another experiment, I created a completely valid xml document with an internal DTD

<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE author [
    <!ELEMENT author (#PCDATA)>
    <!ENTITY js "Jo Smith">
]>
<author>&lt; &js; &gt;</author>

I implemented the parser:foundInternalEntityDeclarationWithName:value:; delegate method and it is clear that the parser is getting the entity data, however the parser:foundCharacters is only called for the pre-defined entities.

2010-03-20 12:53:59.871 xmlParsing[1012:207] Parser Did Start Document
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundElementDeclarationWithName: author model: 
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundInternalEntityDeclarationWithName: js value: Jo Smith
2010-03-20 12:53:59.874 xmlParsing[1012:207] didStartElement: author type: (null)
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters Before: 
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters After: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters Before: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters After: < 
2010-03-20 12:53:59.877 xmlParsing[1012:207] parser foundCharacters Before: < 
2010-03-20 12:53:59.878 xmlParsing[1012:207] parser foundCharacters After: <  
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters Before: <  
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters After: <  >
2010-03-20 12:53:59.880 xmlParsing[1012:207] didEndElement: author with content: <  >
2010-03-20 12:53:59.880 xmlParsing[1012:207] Parser Did End Document

I found a link to a tutorial on Using the SAX Interface of LibXML. The xmlSAXHandler that is used by NSXMLParser allows for a getEntity callback to be defined. After calling getEntity, the expansion of the entity is passed to the characters callback.

NSXMLParser is missing functionality here. What should happen is that the NSXMLParser or its delegate store the entity definitions and provide them to the xmlSAXHandler getEntity callback. This is clearly not happening. I will file a bug report.

In the meantime, the earlier answer of performing a string replacement is perfectly acceptable if your documents are small. Check out the SAX tutorial mentioned above along with the XMLPerformance sample app from Apple to see if implementing the libxml parser on your own is worthwhile.

This has been fun.

Akers answered 14/3, 2010 at 22:57 Comment(5)
:( This did not work. It continue to raise a NSXMLParserUndeclaredEntityError = 26. :( I used your own code. It enters the method resolveExternalEntityName and then raise the exception...Haema
can you include the url? I have another theory that I would like to test.Akers
Still looking for a solution. Found a possible answer cocoabuilder.com/archive/cocoa/… however it uses NSAttributedString which is not available on the current iPhone OSAkers
Ouch :(( In the meantime i tried TouchXml and read about other parsers... but it seems that this is a task you should do on your own. :\Haema
Wow! Your answer is really complete! You realy put everything in this, and i thank you. Great explanation. So the end of the story is that NSXMLParser sucks :)Haema
F
2

A possibly less hacky solution is replace the DTD with a local modified one with all external entity declaration replaced with local one.

This is how I do it:

First, find and replace the document DTD declaration with a local file. For example, replace this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html><body><a href='a.html'>hi!</a><br><p>Hello</p></body></html>

with this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "file://localhost/Users/siuying/Library/Application%20Support/iPhone%20Simulator/6.1/Applications/17065C0F-6754-4AD0-A1EA-9373F6476F8F/App.app/xhtml1-transitional.dtd">
<html><body><a href='a.html'>hi!</a><br><p>Hello</p></body></html>

```

Download the DTD from the W3C URL and add it to your app bundle. You can find the path of the file with following code:

NSBundle* bundle = [NSBundle bundleForClass:[self class]];
NSString* path = [[bundle URLForResource:@"xhtml1-transitional" withExtension:@"dtd"] absoluteString];

Open the DTD file, find any external entity reference:

<!ENTITY % HTMLlat1 PUBLIC
   "-//W3C//ENTITIES Latin 1 for XHTML//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
%HTMLlat1;      

replace it with the content of the entity file ( http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent in the above case)

After replacing all external reference, NSXMLParser should properly handle the entities without the need to download every remote DTD/external entities each time it parse a XML file.

Fattal answered 17/4, 2013 at 7:4 Comment(0)
S
0

You could do a string replace within the data before you parse it with NSXMLParser. NSXMLParser is UTF-8 only as far as I know.

Staggard answered 3/3, 2010 at 13:11 Comment(4)
Yes, i was just thinking about this, but i cannot really think of this as a real solution... because there is the method resolveExternalEntityName:systemID for which the documentation says: "The delegate can resolve the external entity (for example, locating and reading an externally declared DTD) and provide the result to the parser object as an NSData object." So it should exists a way to use it to resolve the entity and translate it for the parser... Probably i'm missing something in the logic of NSXMLParser...Haema
But i'm reading that NSXMLDocument is not available for iphone development, is it true?Haema
NSXMLDocument is available in TouchXML. See here: code.google.com/p/touchcode/wiki/TouchXMLAmanuensis
Thank you, i'll try it for sure. But i cannot stop thinking about what is the correct way to handle this case using only the sdk code...Haema
C
0

I think your going to run into another problem with this example as it isn't vaild XML which is what the NSXMLParser is looking for.

The exact problem in the above is that the tags META, LI, HTML and BODY aren't closed so the parser looks all the way though the rest of the document looking for its closing tag.

The only way around this that I know of if you don't have access to change the HTML is to mirror it with the closing tags inserted.

Committee answered 3/3, 2010 at 14:31 Comment(1)
Sorry... the html code in the example is just the first part of the file. That's my fault. The file has every tag correctly closed.Haema
A
0

I would try using a different parser, like libxml2 - in theory I think that one should be able to handle poor HTML.

Acnode answered 3/3, 2010 at 21:7 Comment(1)
I read that libxml2 has an HTMLparser but i could not find a tutorial, documentation or example about this one, and this is why i first tried NSXMLParser.Haema
G
0

Since I've just started doing iOS development I've been searching for the same thing and found a related mailing list entry: http://www.mail-archive.com/[email protected]/msg17706.html

- (NSData *)parser:(NSXMLParser *)parser resolveExternalEntityName: (NSString *)entityName systemID:(NSString *)systemID {       
    NSAttributedString *entityString = [[[NSAttributedString alloc] initWithHTML:[[NSString stringWithFormat:@"&%@;", entityName] dataUsingEncoding:NSUTF8StringEncoding] documentAttributes:NULL] autorelease];

    NSLog(@"resolved entity name: %@", [entityString string]);

    return [[entityString string] dataUsingEncoding:NSUTF8StringEncoding];
}

This is fairly similar to your original solution and also causes a parser error NSXMLParserErrorDomain error 26; but it does continue parsing after that. The problem is, of course, that it's harder to tell real errors apart ;-)

Gayden answered 24/5, 2012 at 4:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.