Writing an XML header with LXML
Asked Answered
R

1

13

I'm currently writing a script to convert a bunch of XML files from various encodings to a unified UTF-8.

I first try determining the encoding using LXML:

def get_source_encoding(self):
    tree = etree.parse(self.inputfile)
    encoding = tree.docinfo.encoding
    self.inputfile.seek(0)
    return (encoding or '').lower()

If that's blank, I try getting it from chardet:

def guess_source_encoding(self):
    chunk = self.inputfile.read(1024 * 10)
    self.inputfile.seek(0)
    return chardet.detect(chunk).lower()

I then use codecs to convert the encoding of the file:

def convert_encoding(self, source_encoding, input_filename, output_filename):
    chunk_size = 16 * 1024

    with codecs.open(input_filename, "rb", source_encoding) as source:
        with codecs.open(output_filename, "wb", "utf-8") as destination:
            while True:
                chunk = source.read(chunk_size)

                if not chunk:
                    break;

                destination.write(chunk)

Finally, I'm attempting to rewrite the XML header. If the XML header was originally

<?xml version="1.0"?>

or

<?xml version="1.0" encoding="windows-1255"?>

I'd like to transform it to

<?xml version="1.0" encoding="UTF-8"?>

My current code doesn't seem to work:

def edit_header(self, input_filename):
    output_filename = tempfile.mktemp(suffix=".xml")

    with open(input_filename, "rb") as source:
        parser = etree.XMLParser(encoding="UTF-8")
        tree = etree.parse(source, parser)

        with open(output_filename, "wb") as destination:
            tree.write(destination, encoding="UTF-8")

The file I'm currently testing has a header that doesn't specify the encoding. How can I make it output the header properly with the encoding specified?

Ria answered 12/9, 2014 at 0:24 Comment(3)
Why do you need that encoding specification in the output?Garlaand
I like to be... verbose about these things.Ria
@VasiliyFaronov if you don't specify the encoding of your output, how would the next person who needs to ingest that data know what encoding it is?Eboat
P
17

Try:

tree.write(destination, xml_declaration=True, encoding='UTF-8')

From the API docs:

xml_declaration controls if an XML declaration should be added to the file. Use False for never, True for always, None for only if not US-ASCII or UTF-8 (default is None).

Sample from ipython:

In [15]:  etree.ElementTree(etree.XML('<hi/>')).write(sys.stdout, xml_declaration=True, encoding='UTF-8')
<?xml version='1.0' encoding='UTF-8'?>
<hi/>

On reflection, I think you trying way too hard. lxml automatically detects the encoding and correctly parses the file according to that encoding.

So all you really have to do (at least in Python2.7) is:

def convert_encoding(self, source_encoding, input_filename, output_filename):
    tree = etree.parse(input_filename)
    with open(output_filename, 'w') as destination:
        tree.write(destination, encoding='utf-8', xml_declaration=True)
Pitching answered 12/9, 2014 at 3:18 Comment(4)
How good at encoding guessing/transformation is it? I wanted to be sure that it'd transform things properly.Ria
I have files with a header of <?xml version="1.0"?> without showing the encoding which is windows-1255, so I'm not sure how well it would work.Ria
It seems to have worked well, but it has caused other problems.Ria
As OP is using lxml, not the built-in etree lib, the linked docs website is incorrect. But as etree and lxml are quite alike, it works anyway. Here the correct linkCrowe

© 2022 - 2024 — McMap. All rights reserved.