Parsing xhtml with lxml python
Asked Answered
B

1

6

Small problem, really stuck here, I do not understand what's happening, I just want to parse a normal xhtml from the web, nothing special...

Here's the error:

 File "class/page.py", line 85, in xslParse
    doc = lxml.etree.fromstring(self.content)
    File "lxml.etree.pyx", line 2753, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54647)
    File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82764)
    File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81562)
    File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78232)
    File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74488)
    File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75379)
    File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74712)
    XMLSyntaxError: StartTag: invalid element name, line 1, column 2

The self.content is a normal string given by an http response, no clean, no replace, nothing, just the response of the server, so what's the fu..?

The start of the html is that:

<!doctype html>
<!-- paulirish.com/2008/conditional-stylesheets-vs-css-hacks-answer-neither/ -->
<!--[if lt IE 7 ]> <html lang="fr" class="no-js ie6" itemscope itemtype="http://schema.org/Product"> <![endif]-->
<!--[if IE 7 ]>    <html lang="fr" class="no-js ie7" itemscope itemtype="http://schema.org/Product"> <![endif]-->
<!--[if IE 8 ]>    <html lang="fr" class="no-js ie8" itemscope itemtype="http://schema.org/Product"> <![endif]-->
<!--[if IE 9 ]>    <html lang="fr" class="no-js ie9" itemscope itemtype="http://schema.org/Product"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html lang="en" class="no-js" itemscope itemtype="http://schema.org/Product"> <!--<![endif]-->
<head>......

A normal web page, why lxml can't parse a normal well formatted doc?

Barbarism answered 11/8, 2012 at 20:9 Comment(2)
Have you tried using lxml.html.fromstring instead of lxml.etree.fromstring?Ta
gonna check this out right now ! thx broBarbarism
M
15

<!doctype html> indicates that it is a HTML5 document that uses HTML syntax. So you should use an HTML (not XML) parser. For comparison a XML document might start with <?xml version="1.0" encoding="UTF-8"?>.

You could use lxml.html.fromstring() as @unutbu suggested in the comments.

If you receive the page over HTTP then HTML5 document that uses XML syntax should have an XML media type such as application/xhtml+xml or application/xml instead of for example text/html for HTML syntax.

Malley answered 11/8, 2012 at 22:25 Comment(2)
copy that ! thing are clear now, was thinking that html doc was also xml doc... so what's the difference of sending mime type lik text/html and xml, do browser,lxml render both type same?Barbarism
HTML and XML are different languages; different parsers are used to parse them that return different objects that have similar but different interfaces.Malley

© 2022 - 2024 — McMap. All rights reserved.