Getting the lines of each paragraphs of a docx with Apache-POI
Asked Answered
W

1

6

I am using the library Apache-POI for my app. Specifically, POIshadow-all (ver. 3.17) for reading a Word document. I am successfully extracting every paragraph as follows:

enter image description here

what I actually need is extract every line, as follows:

enter image description here

The code to extract every paragraph is this:

 try {

            val fis = FileInputStream(path.path + "/" + document)
            val xdoc = XWPFDocument(OPCPackage.open(fis))

            val paragraphList: MutableList<XWPFParagraph> = xdoc.paragraphs

            private val newParagraph = paragraph.createRun()

                ...

            for (par in paragraphList) {

                    var currentParagraph = par.text
                    Log.i("TAG","current: $currentParagraph")

                        ...

The variable currentParagraph returns a whole paragraph, as expected. However, I would need a variable named currentLine which returns a single line.

I've research about this issue in stackoverflow and other sites. I've found some proposals but none of them works for me. I also tried get dates by ctr and using XWPFRun, without any success.

I would be grateful for any recommendation on how to proceed.

Thanks in advance for your help.

Wingfield answered 13/9, 2020 at 21:39 Comment(6)
Is the line break behind "Would" the result of default text flow? If so, then this is result of the text rendering and apache poi cannot detect this because it does not rendering the text. Or is there any explicit line break set behind "Would"? If so then detecting that line break would be possible.Damalis
No, there is no specific line break :(Wingfield
Then it is not possible. Apache poi is not rendering the page. So it cannot know that the text line is full after the word "Would" and the default text flow puts the word "you" into an new line.Damalis
and thinking about the possibility that there was a specific line break, I have thought about the following: Is it possible to read hidden texts with Apache POI, type ¶, etc? howtogeek.com/363893/…Wingfield
after reading the word documentation, ¶ only applies to the end of the paragraph, not every lineWingfield
@Wingfield did you try set newParagraph.setWordWrap(false) for paragraph.Feder
H
3

The metadata of a document does not store how many lines are there in a given paragraph because it depends on how you render or view it. Think of a word document, if you have a larger font-size, you will have more lines in a given paragraph, alternatively, if you have a smaller font-size, you would have fewer lines in a paragraph. Therefore, the number of lines in each paragraph is inconsistent i.e. a variable.

However, if there’s a hard and fast requirement within your application to have an estimate, you can program some logic like “start a new line after X (a constant) number of characters (round off to the end of the word)”. This again could change depending on the screen size, font-size, zoom-level etc. so my suggestion would be to work out a scenario in your application where you do not explicitly measure the number of lines in a given paragraph, rather the number of words or characters and use that as a yardstick measure to insert a line-break if absolutely necessary.

Another potential approach you could use would be to separate sentences using escape characters e.g. “Start a new sentence after each ‘?’, ‘!’ or ‘.’ character within a paragraph.” This too can get rather tricky, depending on the structure of certain sentences.

Therefore, the answer to your question is that there is no “out of the box” way to detect the number of lines in a given paragraph using Apache POI, you would have to program your own logic there (perhaps using an approach outlined above), if absolutely necessary.

Hiramhirasuna answered 22/9, 2020 at 0:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.