Can CDATA sections be preserved by BeautifulSoup? - McMap

About

Can CDATA sections be preserved by BeautifulSoup?

Asked 7/5, 2013 at 18:56 Answered 26/12, 2015 at 21:0

python xml beautifulsoup lxml cdata

H

1

5

I'm using BeautifulSoup to read, modify, and write an XML file. I'm having trouble with CDATA sections being stripped out. Here's a simplified example.

The culprit XML file:

<?xml version="1.0" ?>
<foo>
    <bar><![CDATA[
        !@#$%^&*()_+{}|:"<>?,./;'[]\-=
    ]]></bar>
</foo>

And here's the Python script.

from bs4 import BeautifulSoup

xmlfile = open("cdata.xml", "r") 
soup = BeautifulSoup( xmlfile, "xml" )
print(soup)

Here's the output. Note the CDATA section tags are missing.

<?xml version="1.0" encoding="utf-8"?>
<foo>
<bar>
        !@#$%^&amp;*()_+{}|:"&lt;&gt;?,./;'[]\-=
    </bar>
</foo>

I also tried printing soup.prettify(formatter="xml") and got the same result with slightly different whitespace. There isn't much in the docs about reading in CDATA sections, so maybe this is an lxml thing?

Is there a way to tell BeautifulSoup to preserve CDATA sections?

Update Yes, it's an lxml thing. http://lxml.de/api.html#cdata So, the question becomes, is it possible to tell BeautifulSoup to initialize lxml with strip_cdata=False?

Harriette answered 7/5, 2013 at 18:56 Comment(4)

This thread suggests there is a bug in lxml affecting this: groups.google.com/forum/?fromgroups=#!topic/beautifulsoup/… – Topple 7/5, 2013 at 19:21

@Topple The last post does suggest that what I want to do isn't possible, though. Feel free to post that link as an answer so I can choose it. – Harriette 7/5, 2013 at 21:16

possible duplicate of how can i grab CData out of BeatuifulSoup – Jasper 7/5, 2013 at 22:29

This isn't a duplicate. That question is about how to find/extract CDATA sections. This one is about how to preserve them when XML is output. The first is possible, the latter is not. – Harriette 9/5, 2013 at 23:34

A

7

In my case if I use

soup = BeautifulSoup( xmlfile, "lxml-xml" )

then cdata is preserved and accesible.

Ama answered 26/12, 2015 at 21:0 Comment(1)

If you use lxml-xml, the <![CDATA[...]]> surrounding text is removed and you're left with just the contents. If you want to keep the <![CDATA[]]> literal text, you can use html.parser. – Cardoso 15/1, 2021 at 16:15

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.