I am trying to create a program in python that can find a specific word in a .docx file and return page number that it occurred on. So far, in looking through the python-docx documentation I have been unable to find how do access the page number or even the footer where the number would be located. Is there a way to do this using python-docx or even just python? Or if not, what would be the best way to do this?
Short answer is no, because the page breaks are inserted by the rendering engine, not determined by the .docx file itself.
However, certain clients place a <w:lastRenderedPageBreak>
element in the saved XML to indicate where they broke the page last time it was rendered.
I don't know which do this (although I expect Word itself does) and how reliable it is, but that's the direction I would recommend if you wanted to work in Python. You could potentially use python-docx to get a reference to the lxml element you want (like w:document/w:body
) and then use XPath commands or something to iterate through to a specific page, but just thinking it through a bit it's going to be some detailed development there to get that working.
If you work in the native Windows MS Office API you might be able to get something better since it actually runs the Word application.
If you're generating the documents in python-docx, those elements won't be placed because it makes no attempt to render the document (nor is it ever likely to). We're also not likely to add support for w:lastRenderedPageBreak anytime soon; I'm not even quite sure what that would look like.
If you search on 'lastRenderedPageBreak' and/or 'python-docx page break' you'll see other questions/answers here that may give a little more.
Using Python-docx: identify a page break in paragraph
from docx import Document
fn='1.doc'
document = Document(fn)
pn=1
import re
for p in document.paragraphs:
r=re.match('Chapter \d+',p.text)
if r:
print(r.group(),pn)
for run in p.runs:
if 'w:br' in run._element.xml and 'type="page"' in run._element.xml:
pn+=1
print('!!','='*50,pn)
I just found an easy way to find the number of pages
import docx2pdf
from pathlib import Path
from PyPDF2 import PdfReader
x = rf'Temp\document.docx'
p = docx2pdf.convert('Temp\document.docx','Temp\document.pdf')
r = PdfReader(str('Temp\document.pdf'))
num_pages = len(r.pages)
print(num_pages)
Modifying the best answer I found to add page numbers, I was able to add page count.
The key being to utilize the NumPages default field for MS Word
from docx.oxml import OxmlElement, ns
def create_element(name):
return OxmlElement(name)
def create_attribute(element, name, value):
element.set(ns.qn(name), value)
def add_page_count(run):
fldChar1 = create_element('w:fldChar')
create_attribute(fldChar1, 'w:fldCharType', 'begin')
instrText = create_element('w:instrText')
create_attribute(instrText, 'xml:space', 'preserve')
instrText.text = "NumPages"
fldChar2 = create_element('w:fldChar')
create_attribute(fldChar2, 'w:fldCharType', 'end')
run._r.append(fldChar1)
run._r.append(instrText)
run._r.append(fldChar2)
doc = Document()
add_page_number(doc.sections[0].footer.paragraphs[0].add_run())
doc.save("your_doc.docx")
© 2022 - 2024 — McMap. All rights reserved.