Is there any way to read .docx file include auto numbering using python-docx
Asked Answered
E

3

24

Problem statement: Extract sections from .docx file including autonumbering.

I tried python-docx to extract text from .docx file but it excludes the autonumbering.

from docx import Document

document = Document("wadali.docx")


def iter_items(paragraphs):
    for paragraph in document.paragraphs:
        if paragraph.style.name.startswith('Agt'):
            yield paragraph
        if paragraph.style.name.startswith('TOC'):
            yield paragraph
        if paragraph.style.name.startswith('Heading'):
            yield paragraph
        if paragraph.style.name.startswith('Title'):
            yield paragraph
        if paragraph.style.name.startswith('Heading'):
            yield paragraph
        if paragraph.style.name.startswith('Table Normal'):
            yield paragraph
        if paragraph.style.name.startswith('List'):
            yield paragraph


for item in iter_items(document.paragraphs):
    print item.text
Exam answered 30/8, 2018 at 9:59 Comment(9)
Could you provide a minimum working example, so we can reproduce your problem and work on it?Tannenwald
You can't do this. There is no API support and i am not even sure you can extract this from the XML source either.Bandstand
@Tannenwald edited question added my work with docx.Exam
@PearlySpencer is there any other lib or source which can be helpfull to extract text with autonumberingExam
To my knowledge no. But as i said, you might be able to extract what you need directly from the XML file depending on the contents of your document.Bandstand
to my knowledge, auto numbering in docx is storing as a reference to a "Numbering Definition Instance", you may extract the definition and compute from it.Byrnes
@Byrnes which one is this? I have not come across it before.Bandstand
section 17.9.16 of ISO/IEC 29500-1:2012(E) @PearlySpencerByrnes
@Byrnes ah yes, i thought you were referring to the API.Bandstand
D
10

It appears that currently python-docx v0.8 does not fully support numbering. You need to do some hacking.

First, for the demo, to iterate the document paragraphs, you need to write your own iterator. Here is something functional:

import docx.document
import docx.oxml.table
import docx.oxml.text.paragraph
import docx.table
import docx.text.paragraph


def iter_paragraphs(parent, recursive=True):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, docx.document.Document):
        parent_elm = parent.element.body
    elif isinstance(parent, docx.table._Cell):
        parent_elm = parent._tc
    else:
        raise TypeError(repr(type(parent)))

    for child in parent_elm.iterchildren():
        if isinstance(child, docx.oxml.text.paragraph.CT_P):
            yield docx.text.paragraph.Paragraph(child, parent)
        elif isinstance(child, docx.oxml.table.CT_Tbl):
            if recursive:
                table = docx.table.Table(child, parent)
                for row in table.rows:
                    for cell in row.cells:
                        for child_paragraph in iter_paragraphs(cell):
                            yield child_paragraph

You can use it to find all document paragraphs including paragraphs in table cells.

For instance:

import docx

document = docx.Document("sample.docx")
for paragraph in iter_paragraphs(document):
    print(paragraph.text)

To access the numbering property, you need to search in the "protected" members paragraph._p.pPr.numPr, which is a docx.oxml.numbering.CT_NumPr object:

for paragraph in iter_paragraphs(document):
    num_pr = paragraph._p.pPr.numPr
    if num_pr is not None:
        print(num_pr)  # type: docx.oxml.numbering.CT_NumPr

Note that this object is extracted from the numbering.xml file (inside the docx), if it exists.

To access it, you need to read your docx file like a package. For instance:

import docx.package
import docx.parts.document
import docx.parts.numbering

package = docx.package.Package.open("sample.docx")

main_document_part = package.main_document_part
assert isinstance(main_document_part, docx.parts.document.DocumentPart)

numbering_part = main_document_part.numbering_part
assert isinstance(numbering_part, docx.parts.numbering.NumberingPart)

ct_numbering = numbering_part._element
print(ct_numbering)  # CT_Numbering
for num in ct_numbering.num_lst:
    print(num)  # CT_Num
    print(num.abstractNumId)  # CT_DecimalNumber

Mor information is available in the Office Open XMl documentation.

Deca answered 14/9, 2018 at 16:21 Comment(2)
It's printing to me: <CT_DecimalNumber '<w:abstractNumId>' at 0x10d0eef40> AND <CT_Num '<w:num>' at 0x10d0eecc0> not the decimal value, any ideas?Imposture
num_pr contains two integer fiels that can be of interest when trying to get paragraph numbers: num_pr.numId.val and num_pr.ilvl.val (you may want to check that all of the intermediate items are not None before using them though). I attempted to use those to rebuild the full paragraph number the same way it is displayed in Word (eg: 5.1.4) but didn't quite figure out how to get there yet.Oringa
S
10

There is a package, docx2python which does this in a lot simpler fashion: pypi.org/project/docx2python/

The following code:

from docx2python import docx2python
document = docx2python("C:/input/MyDoc.docx")
print(document.body)

produces a list which contains the contents including bullet lists in a nice parse-able fashion.

Selfdeceit answered 25/4, 2020 at 12:57 Comment(1)
Thanks @Selfdeceit it worked for me. Only thing is if the doc contains images then the text gets divided into a different list and we have to extend then if you want one single list.Iridium
U
1

Since the numbers are not stored in the DOCX but rather computed in Word on-the-fly you need to do the same in Python. E.g. here for the case of Heading styles:

document = Document(documentFn)

maxLevel = 7
zeros = [0] * maxLevel
counters = [0] * maxLevel
headingStyles = ['Heading %i' % i for i in range(maxLevel)]


def makeNumber(counters):
    return '.'.join([str(i) for i in counters if i])


for para in document.paragraphs:
    if para.style.name in headingStyles and para.text.strip() != '':
        level = int(para.style.name.split(' ')[1]) - 1
        counters = (counters[:level] + [counters[level] + 1]) + zeros
        text = makeNumber(counters) + ' ' + para.text
        print('    ' * (level), text)

(initial idea courtesy of retsyo on GitHub)

Urnfield answered 5/1 at 12:2 Comment(1)
This combined with the iter_paragraphs() from the answer above replacing document.paragraphs did the job for me (some of the titles were inside tables in my document).Oringa

© 2022 - 2024 — McMap. All rights reserved.