Encoding error while parsing RSS with lxml
Asked Answered
D

3

9

I want to parse downloaded RSS with lxml, but I don't know how to handle with UnicodeDecodeError?

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)

But I get an error:

tree   = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 559, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64027)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 97: ordinal not in range(128)
Delladelle answered 27/4, 2011 at 23:44 Comment(0)
L
0

You should probably only be trying to define the character encoding as a last resort, since it's clear what the encoding is based on the XML prolog (if not by the HTTP headers.) Anyway, it's unnecessary to pass the encoding to etree.XMLParser unless you want to override the encoding; so get rid of the encoding parameter and it should work.

Edit: okay, the problem actually seems to be with lxml. The following works, for whatever reason:

parser = etree.XMLParser(ns_clean=True, recover=True)
etree.parse('http://wiadomosci.onet.pl/kraj/rss.xml', parser)
Lexy answered 28/4, 2011 at 0:12 Comment(2)
There is still the same error when I run script without encoding parameter...;/. Why etree.XMLParser finishes with error despite passing right encoding?Delladelle
It is working now, but I had to upgrade lxml to 2.2.8 version, because with 2.2.4 I wasn't able to parse remote URL. Moreover code from my question works when I change this: tree = etree.parse(StringIO.StringIO(response), parser)Delladelle
B
45

I ran into a similar problem, and it turns out this has NOTHING to do with encodings. What's happening is this - lxml is throwing you a totally unrelated error. In this case, the error is that the .parse function expects a filename or URL, and not a string with the contents itself. However, when it tries to print out the error, it chokes on non-ascii characters and shows that completely confusing error message. It is highly unfortunate and other people have commented on this issue here:

https://mailman-mail5.webfaction.com/pipermail/lxml/2009-February/004393.html

Luckily, yours is a very easy fix. Just replace .parse with .fromstring and you should be totally good to go:

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)

## lxml Y U NO MAKE SENSE!!!
tree = etree.fromstring(response, parser)

Just tested this on my machine and it worked fine. Hope it helps!

Beechnut answered 18/1, 2012 at 21:49 Comment(1)
MAY YOUR DAYS BE BLESSED WITH THE ETERNAL BEAUTY AND HARMONY SIR!Heedless
I
4

It's often easier to get the string loaded and sorted out for the lxml library first, and then call fromstring on it, rather than rely on the lxml.etree.parse() function and its difficult to manage encoding options.

This particular rss file begins with the encoding declaration, so everything should just work:

<?xml version="1.0" encoding="utf-8"?>

The following code shows some of the different variations you can apply to make etree parse for different encodings. You can also request it to write out different encodings too, which will appear in the headers.

import lxml.etree
import urllib2

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request).read()
print [response]
        # ['<?xml version="1.0" encoding="utf-8"?>\n<feed xmlns=... <title>Wiadomo\xc5\x9bci...']

uresponse = response.decode("utf8")
print [uresponse]    
        # [u'<?xml version="1.0" encoding="utf-8"?>\n<feed xmlns=... <title>Wiadomo\u015bci...']

tree = lxml.etree.fromstring(response)
res = lxml.etree.tostring(tree)
print [res]
        # ['<feed xmlns="http://www.w3.org/2005/Atom">\n<title>Wiadomo&#347;ci...']

lres = lxml.etree.tostring(tree, encoding="latin1")
print [lres]
        # ["<?xml version='1.0' encoding='latin1'?>\n<feed xmlns=...<title>Wiadomo&#347;ci...']


# works because the 38 character encoding declaration is sliced off
print lxml.etree.fromstring(uresponse[38:])   

# throws ValueError(u'Unicode strings with encoding declaration are not supported.',)
print lxml.etree.fromstring(uresponse)

Code can be tried here: http://scraperwiki.com/scrapers/lxml_and_encoding_declarations/edit/#

Ideate answered 4/5, 2011 at 10:36 Comment(0)
L
0

You should probably only be trying to define the character encoding as a last resort, since it's clear what the encoding is based on the XML prolog (if not by the HTTP headers.) Anyway, it's unnecessary to pass the encoding to etree.XMLParser unless you want to override the encoding; so get rid of the encoding parameter and it should work.

Edit: okay, the problem actually seems to be with lxml. The following works, for whatever reason:

parser = etree.XMLParser(ns_clean=True, recover=True)
etree.parse('http://wiadomosci.onet.pl/kraj/rss.xml', parser)
Lexy answered 28/4, 2011 at 0:12 Comment(2)
There is still the same error when I run script without encoding parameter...;/. Why etree.XMLParser finishes with error despite passing right encoding?Delladelle
It is working now, but I had to upgrade lxml to 2.2.8 version, because with 2.2.4 I wasn't able to parse remote URL. Moreover code from my question works when I change this: tree = etree.parse(StringIO.StringIO(response), parser)Delladelle

© 2022 - 2024 — McMap. All rights reserved.