How to access OpenXML content by page number?

Asked 12/10, 2016 at 7:29 Answered 3/11, 2021 at 9:40

Using OpenXML, can I read the document content by page number?

wordDocument.MainDocumentPart.Document.Body gives content of full document.

  public void OpenWordprocessingDocumentReadonly()
        {
            string filepath = @"C:\...\test.docx";
            // Open a WordprocessingDocument based on a filepath.
            using (WordprocessingDocument wordDocument =
                WordprocessingDocument.Open(filepath, false))
            {
                // Assign a reference to the existing document body.  
                Body body = wordDocument.MainDocumentPart.Document.Body;
                int pageCount = 0;
                if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
                {
                    pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
                }
                for (int i = 1; i <= pageCount; i++)
                {
                    //Read the content by page number
                }
            }
        }

MSDN Reference

Update 1:

it looks like page breaks are set as below

<w:p w:rsidR="003328B0" w:rsidRDefault="003328B0">
        <w:r>
            <w:br w:type="page" />
        </w:r>
    </w:p>

So now I need to split the XML with above check and take InnerTex for each, that will give me page vise text.

Now question becomes how can I split the XML with above check?

Update 2:

Page breaks are set only when you have page breaks, but if text is floating from one page to other pages, then there is no page break XML element is set, so it revert back to same challenge how o identify the page separations.

Caldarium answered 12/10, 2016 at 7:29 Comment(2)

Have a read of this #14480198 – Giorgione 12/10, 2016 at 7:55

@PaulZahra I don't find such element (lastRenderedPageBreak) in XML – Caldarium 12/10, 2016 at 8:42

This is how I ended up doing it.

  public void OpenWordprocessingDocumentReadonly()
        {
            string filepath = @"C:\...\test.docx";
            // Open a WordprocessingDocument based on a filepath.
            Dictionary<int, string> pageviseContent = new Dictionary<int, string>();
            int pageCount = 0;
            using (WordprocessingDocument wordDocument =
                WordprocessingDocument.Open(filepath, false))
            {
                // Assign a reference to the existing document body.  
                Body body = wordDocument.MainDocumentPart.Document.Body;
                if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
                {
                    pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
                }
                int i = 1;
                StringBuilder pageContentBuilder = new StringBuilder();
                foreach (var element in body.ChildElements)
                {
                    if (element.InnerXml.IndexOf("<w:br w:type=\"page\" />", StringComparison.OrdinalIgnoreCase) < 0)
                    {
                        pageContentBuilder.Append(element.InnerText);
                    }
                    else
                    {
                        pageviseContent.Add(i, pageContentBuilder.ToString());
                        i++;
                        pageContentBuilder = new StringBuilder();
                    }
                    if (body.LastChild == element && pageContentBuilder.Length > 0)
                    {
                        pageviseContent.Add(i, pageContentBuilder.ToString());
                    }
                }
            }
        }

Downside: This wont work in all scenarios. This will work only when you have a page break, but if you have text extended from page 1 to page 2, there is no identifier to know you are in page two.

Caldarium answered 13/10, 2016 at 7:17 Comment(2)

Thanks for the answer !How to copy the page wise content into new document? – Sepaloid 6/10, 2018 at 10:1

is there an alternative to that for pages without page breaks – Droshky 17/9, 2022 at 21:32

You cannot reference OOXML content via page numbering at the OOXML data level alone.

Hard page breaks are not the problem; hard page breaks can be counted.
Soft page breaks are the problem. These are calculated according to line break and pagination algorithms which are implementation dependent; it is not intrinsic to the OOXML data. There is nothing to count.

What about w:lastRenderedPageBreak, which is a record of the position of a soft page break at the time the document was last rendered? No, w:lastRenderedPageBreak does not help in general either because:

By definition, w:lastRenderedPageBreak position is stale when content has been changed since last opened by a program that paginates its content.
In MS Word's implementation, w:lastRenderedPageBreak is known to be unreliable in various circumstances including

If you're willing to accept a dependence on Word Automation, with all of its inherent licensing and server operation limitations, then you have a chance of determining page boundaries, page numberings, page counts, etc.

Otherwise, the only real answer is to move beyond page-based referencing frameworks that are dependent upon proprietary, implementation-specific pagination algorithms.

Hyperextension answered 19/10, 2016 at 19:18 Comment(12)

Thanks for the details. That is what I figured out with my research as well. But can I use Word Automation from a web based interface, I mean I have the word document as a binary in my database and use this to get the page wise content using licensed Word Automation? – Caldarium 19/10, 2016 at 19:32

How about using Add-In Express add-in-express.com/creating-addins-blog/2013/08/07/… – Caldarium 19/10, 2016 at 19:46

I don't recommend using Word Automation on the server because the inherent licensing and server operation limitations stated by Microsoft, but if it works for your situation, great. – Hyperextension 19/10, 2016 at 19:56

The techniques discussed in the Add-in Express post you cite require Word Automation. – Hyperextension 19/10, 2016 at 20:2

Yes, can I use Add-In Express or just Word Automation and pass docx binary to get the page wise content? – Caldarium 19/10, 2016 at 20:14

You can modulo the licensing and server operation concerns I keep mentioning. Note also that DOCX files aren't usually described as binary; they're OOXML files packaged via OPC. Usually, it's DOC files (pre-2007 format) that are referred to as binary. – Hyperextension 19/10, 2016 at 20:20

This answer confirms I cannot do what I need, but doesn't give me a solution. Should I still give the bounty to this answer? – Caldarium 21/10, 2016 at 8:12

I hope you found this canonical answer to have been helpful enough to be worthy of the bounty, but, regardless, if I've conveyed a deeper understanding of the solution space to you or to future readers, I'll have accomplished my goal. – Hyperextension 21/10, 2016 at 12:38

fair enough, you get the bounty :) – Caldarium 21/10, 2016 at 17:32

Is it still true now that the lastRenderedPageBreak elements are not reliable for determining page numbers? The links about specific cases all give 404s now. – Thionate 16/7, 2024 at 8:17

@PeterDongan: The links are dead because Microsoft killed MSDN TechNet forums, not because lastRenderedPageBreak suddenly became reliable. If you want to search The Wayback Machine, you can probably find the articles again. For example, here's one. Feel free to update the links here if you do; I do not have time to do so myself. Thanks. – Hyperextension 16/7, 2024 at 13:41

Thanks. I experimented since posting that comment and found that they're still unreliable alright. I wasn't inferring that they were now reliable from the links being dead. I was just wondering if things had changed since this answer was posted eight years ago. – Thionate 16/7, 2024 at 15:9

This is how I ended up doing it.

  public void OpenWordprocessingDocumentReadonly()
        {
            string filepath = @"C:\...\test.docx";
            // Open a WordprocessingDocument based on a filepath.
            Dictionary<int, string> pageviseContent = new Dictionary<int, string>();
            int pageCount = 0;
            using (WordprocessingDocument wordDocument =
                WordprocessingDocument.Open(filepath, false))
            {
                // Assign a reference to the existing document body.  
                Body body = wordDocument.MainDocumentPart.Document.Body;
                if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
                {
                    pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
                }
                int i = 1;
                StringBuilder pageContentBuilder = new StringBuilder();
                foreach (var element in body.ChildElements)
                {
                    if (element.InnerXml.IndexOf("<w:br w:type=\"page\" />", StringComparison.OrdinalIgnoreCase) < 0)
                    {
                        pageContentBuilder.Append(element.InnerText);
                    }
                    else
                    {
                        pageviseContent.Add(i, pageContentBuilder.ToString());
                        i++;
                        pageContentBuilder = new StringBuilder();
                    }
                    if (body.LastChild == element && pageContentBuilder.Length > 0)
                    {
                        pageviseContent.Add(i, pageContentBuilder.ToString());
                    }
                }
            }
        }

Downside: This wont work in all scenarios. This will work only when you have a page break, but if you have text extended from page 1 to page 2, there is no identifier to know you are in page two.

Caldarium answered 13/10, 2016 at 7:17 Comment(2)

Thanks for the answer !How to copy the page wise content into new document? – Sepaloid 6/10, 2018 at 10:1

is there an alternative to that for pages without page breaks – Droshky 17/9, 2022 at 21:32

Unfortunately, As Why only some page numbers stored in XML of docx file? answers, docx dose not contains reliable page number service. Xml files carry no page number, until microsoft Word open it and render dynamically. Even you read openxml documents like https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.pagenumber?view=openxml-2.8.1 .

You can unzip some docx files, and search "page" or "pg". Then you will know it. I do this on different kinds of docx files in my situation. All tell me the same truth. Glad if this helps.

Stress answered 3/11, 2021 at 9:40 Comment(1)

Downvote explanation: Searching for "page" or "pg" may work for your documents but is definitely not a solution to identifying page breaks in all DOCX documents. – Hyperextension 2/4, 2023 at 14:37

-1

List<Paragraph> Allparagraphs = wp.MainDocumentPart.Document.Body.OfType<Paragraph>().ToList();

List<Paragraph> PageParagraphs = Allparagraphs.Where (x=>x.Descendants<LastRenderedPageBreak>().Count() ==1) .Select(x => x).Distinct().ToList();

Descombes answered 12/2, 2018 at 13:16 Comment(1)

Add thee explanation about the code and how it solve the issue – Androw 12/2, 2018 at 13:35

-2

Rename docx to zip. Open docProps\app.xml file. :

 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
  <Template>Normal</Template>
  <TotalTime>0</TotalTime>
  <Pages>1</Pages>
  <Words>141</Words>
  <Characters>809</Characters>
  <Application>Microsoft Office Word</Application>
  <DocSecurity>0</DocSecurity>
  <Lines>6</Lines>
  <Paragraphs>1</Paragraphs>
  <ScaleCrop>false</ScaleCrop>
  <HeadingPairs>
    <vt:vector size="2" baseType="variant">
      <vt:variant>
        <vt:lpstr>Название</vt:lpstr>
      </vt:variant>
      <vt:variant>
        <vt:i4>1</vt:i4>
      </vt:variant>
    </vt:vector>
  </HeadingPairs>
  <TitlesOfParts>
    <vt:vector size="1" baseType="lpstr">
      <vt:lpstr/>
    </vt:vector>
  </TitlesOfParts>
  <Company/>
  <LinksUpToDate>false</LinksUpToDate>
  <CharactersWithSpaces>949</CharactersWithSpaces>
  <SharedDoc>false</SharedDoc>
  <HyperlinksChanged>false</HyperlinksChanged>
  <AppVersion>14.0000</AppVersion>
</Properties>

OpenXML lib reads wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text from <Pages>1</Pages> property . This properies are created only by winword application. if word document changed wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text is not actual. if word document created programmatically the wordDocument.ExtendedFilePropertiesPart is offten null.

Zhdanov answered 24/6, 2019 at 16:45 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags