The cause for this simply is that "Total For Line Extended Price" is at a y coordinate of 507.37 while "Part Description Quantity Unit Price" is at a y coordinate of 506.42.
The LocationTextExtractionStrategy
allows for small variations by only considering the integer part of the y coordinates but even the integer parts differ here. Thus, it assumes the former headings to be on a line above the latter ones and outputs its results accordingly.
In case of such variations usually a first attempt might be to try the SimpleTextExtractionStrategy
. Unfortunately this does not help here as the former text actually is drawn before the latter text. Thus, this strategy also returns the headings in the wrong order.
In such a situation you need a strategy that works differently, e.g. the strategy HorizontalTextExtractionStrategy or HorizontalTextExtractionStrategy2 (depending on your iText version, the former one up to iText 5.5.8, the latter one for the current development code 5.5.9-SNAPSHOT) from this answer. Using it you'll get
Part Description Quantity Unit Price Total For Line Extended Price
Landing Fee 1.00 407.84 $ USD 407.84 407.84 $
Parking 1.00 101.96$ USD 101.96 101.96$
??? 1.00 51.65$ USD 51.65 51.65$
Pax Baggage Handling Fee 5.00 8.49$ USD 42.45 42.45 $
Pax Airport Tax 5.00 26.36 $ USD 131.80 131.80$
GA terminal for crew on Arr ferry fit 1.00 125.00$ USD 125.00 125.00$
VIP lounge for Pax on Dep. 5.00 124.00$ USD 620.00 620.00 $
GA terminal for crew on dep. 1.00 125.00$ USD 125.00 125.00$
VIP lounge for Guest on Dep. 1.00 38.00$ USD 38.00 38.00 $
Crew transfer on arr 1.00 70.00 $ USD 70.00 70.00 $
Crew transfer on dep 1.00 70.00 $ USD 70.00 70.00 $
Lavatory Service 1.00 75.00 $ USD 75.00 75.00 $
Catering-ISS 1.00 1,324.28 $ USD 1,324.28 1,324.28 $
Ground Handling 1.00 190.00$ USD 190.00 190.00$
Pax Handling 1.00 190.00$ USD 190.00 190.00$
Push Back 1.00 83.00 $ USD 83.00 83.00 $
Towing 1.00 110.00$ USD 110.00 110.00$
(result of using TextExtraction
test method testLocation_text_extraction_test
)
Unfortunately, though, these strategies fail if there are overlapping lines in different side-by-side columns, e.g. in your document the invoice recipient address and the information to its right.
You might either try to tweak the horizontal strategies (e.g. by also analyzing horizontal gaps separating columns) or try a combined approach, using the output of multiple strategies for the same document.