Parsing RSS with Elementtree in Python
Asked Answered
D

1

6

How do you search for namespace-specific tags in XML using Elementtree in Python?

I have an XML/RSS document like:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:wp="http://wordpress.org/export/1.0/"
>
<channel>
    <title>sometitle</title>
    <pubDate>Tue, 28 Aug 2012 22:36:02 +0000</pubDate>
    <generator>http://wordpress.org/?v=2.5.1</generator>
    <language>en</language>
    <wp:wxr_version>1.0</wp:wxr_version>
    <wp:category><wp:category_nicename>apache</wp:category_nicename><wp:category_parent></wp:category_parent><wp:cat_name><![CDATA[Apache]]></wp:cat_name></wp:category>
</channel>
</rss>

But when I try and find all "wp:category" tags by doing:

import xml.etree.ElementTree as xml
tree = xml.parse(fn)
doc = tree.getroot()
categories = doc.findall('channel/wp:category')

I get the error:

SyntaxError: prefix 'wp' not found in prefix map

Searching for any non-namespace specific fields works just fine. What am I doing wrong?

Dilorenzo answered 12/10, 2012 at 14:56 Comment(0)
M
3

You need to handle the namespace prefixes, either by using iterparse and handling the event directly or by explicitly declaring the prefixes you're interested in before parsing. Depending on what you're trying to do, I will admit in my lazier moments I just strip all the prefixes out with a string replace before parsing the XML.

EDIT: this similar question might help.

Monkfish answered 12/10, 2012 at 15:1 Comment(2)
This makes no sense. The namespace prefixes are defined in the parent <rss> tag. I shouldn't have to pre-parse my RSS document just so I can spoon-feed the namespaces to my RSS parser...Dilorenzo
I'm not arguing with you, I'm just saying that's how I got around it.Monkfish

© 2022 - 2024 — McMap. All rights reserved.