How do I view the XML produced by the python-docx package

from docx import Document import bs4 def add_parsed_html_to_paragraph(p, s): soup = bs4.BeautifulSoup(s) para = soup.find('p') for e in para.children: if type(e) == bs4.element.NavigableString: r = p.add_run(str(e)) else: r = p.add_run(e.text) if e.name == 'sub': r.font.subscript = True elif e.name == 'sup': r.font.superscript = True title = 'A formula: H<sub>2</sub>O.' document = Document() p = document.add_paragraph() add_parsed_html_to_paragraph(p, title) # ... Now I want to check p or document for the correct XML

Each so-called oxml element object in python-docx has an .xml property for precisely this use case. It's used for the internal unit tests.

All you need is access to the internal variable used for the XML element, which is generally available by clicking the [source] link next to that object in the docs, like here: https://python-docx.readthedocs.io/en/latest/api/text.html#paragraph-objects

Clicking through that link, you can find that for a paragraph, the underlying XML element is available on ._p. Usually it's the tagname of the element without the namespace prefix, although sometimes its the generic ._element. This latter one is a good one to try in a pinch if you need to guess.

So using it is as simple as:

>>> paragraph._p.xml
<w:p>
  <w:pPr>
    <w:jc w:val="right"/>
  </w:pPr>
  <w:r>
    <w:t>Right-aligned</w:t>
  </w:r>
</w:p>

There is a companion domain-specific language (DSL) in the unit-test utilities called CXML (compact XML) which allows you to take care of namespacing, which is otherwise a big pain. It looks something like this:

expected_xml = cxml.xml('w:p(w:pPr/w:jc{w:val=right},w:r/w:t"Right-aligned")')

You can see examples throughout the unit tests like here: https://github.com/python-openxml/python-docx/blob/master/tests/text/test_paragraph.py#L113 and ask more specific questions here with the "python-docx" tag if you need help.

Recommended topics

Hot tags