Apache POI: Extract a paragraph and the table that follows from word document (docx) in java
Asked Answered
S

2

8

I have a bunch of word documents (docx) that details test case name as a paragraph title and the test steps in the subsequent table along with some other information.

I need to extract the test case name (from paragraph) and the test steps (from table) from the table using Apache POI.

The example word contents are

Section 1: Index
Section 2: Some description
    A. Paragraph 1
    B. Table 1
    C. Paragraph 2
    D. Paragraph 3
    E. Table 2
Section 3: test cases ( The title "test cases" is constant, so I can look for it in the doc)
    A. Paragraph 4 (First test case)
    B. Table 3 (Test steps table immediately after the para 4)
    C. Paragraph 5 (Second test case)
    B. Table 4 (Test steps table immediately after the para 5)

Apache POI provides APIs to give list of paragraphs and tables but I am not able to read the paragraph (test case) and immediately look for a table that follows this paragraph.

I tried using XWPFWordExtractor (to read all the text), bodyElementIterator (to iterate over all the body elements) but most of them give getParagraphText() method that gives a list of paragraphs [para1, para2, para3, para4, para5] and getTables() method that gives all the tables in the document as a list [table1, table2, table3, table4].

How do I go over all paragraphs, stop at paragraph that is after the heading 'test cases' (paragraph 4) and then look for table that is immediately after the paragraph 4 (table 3). Then repeat this for paragraph 5 and table 4.

Here is the gist link (code) I tried that gives a list of paragraphs and list of tables but not in the sequence that I can track.

Any help is much appreciated.

Shagbark answered 2/6, 2016 at 17:57 Comment(0)
K
12

The Word API in POI is still in flux, and buggy, but you should be able to iterate over the paragraphs in one of two ways:

XWPFDocument doc = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = doc.getParagraphs();
for (XWPFParagraph p : paragraphs) {
   ... do something here
}

or

XWPFDocument doc = new XWPFDocument(fis);
Iterator<XWPFParagraph> iter = doc.getParagraphsIterator();
while (iter.hasNext()) {
   XWPFParagraph p = iter.next();
   ... do something here
}

The Javadocs say that XWPFDocument.getParagraphs() retrieves the paragraphs that hold the text in in the header or footer, but I have to believe that this is a cut and paste error as the XWPFHeaderFooter.getParagraphs() says the same thing. Looking at the source, XWPFDocument.getParagraphs() returns an unmodifiable list while using the iterator leaves the paragraphs modifiable. This is likely to change in the future, but it is the way it works for now.

To retrieve a list of all body elements, Paragraphs and Tables, you need to use:

XWPFDocument doc = new XWPFDocument(fis);
Iterator<IBodyElement> iter = doc.getBodyElementsIterator();
while (iter.hasNext()) {
   IBodyElement elem = iter.next();
   if (elem instanceof XWPFParagraph) {
      ... do something here
   } else if (elem instanceof XWPFTable) {
      ... do something here
   }
}

This should allow you to loop through all body elements in order.

Kaolack answered 3/6, 2016 at 16:5 Comment(5)
thanks for the comments, my main concern is, paragraph list gives list of paras and table list gives list of tables but how do I keep track of the sequence in which they appear? my requirement is to extract table contents that follows immediately after a specific paragraph contents. Some how I have to keep reading paras and when my required para comes in, stop and start reading tables from that point on.Shagbark
@Shagbark did you figure this out. I have the same problem. Perhaps post as your own answer if you have a solutionWinkler
I did find the solution. I apologize for not posting earlier. I will post the answer in a couple of day.. I am traveling and unfortunately do not have access to the source code.Shagbark
@SebastianZeki - the answer is in the edit above that was made June 5th. Note that there is a bodyElements list in the XWPFDocument which contains all the paragraphs and tables in order.Kaolack
@Shagbark I'd still like to see your solution.Rubyeruch
E
0

The only solution I can come up with is to use the word extractor, compare the paragraph content from this extractor with the XWPFDocument getParagraphArray and then locate the table by comparing the content from the extractor and getTables().

Eyecatching answered 26/11, 2021 at 12:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.