I am looking at a parser for pdf and MS office document formats to extract tabular information from files. Was thinking of writing separate implementations when I saw Apache Tika. I am able to extract full text from any of these file formats. But my requirement is to extract tabular data where I am expecting 2 columns in a key value format. I checked most of the stuff available in the net for a solution but could not find any. Any pointers for this?
Well I went ahead and implemented it separately using apache poi for the MS formats. I came back to Tika for PDF. What Tika does with the docs is that it will output it as "SAX based XHTML events"1
So basically we can write a custom SAX implementation to parse the file.
The structure text output will be of the form (Meta details avoided)
<body><div class="page"><p/>
<p>Key1 Value1 </p>
<p>Key2 Value2 </p>
<p>Key3 Value3</p>
<p/>
</div>
</body>
In our SAX implementation we can consider the first part as key (for my problem I already know the key and I am looking for values, so it is a substring).
Override public void characters(char[] ch, int start, int length) with the logic
Please note for my case the structure of the content is fixed and I know the keys that are coming in, so it was easy doing it this way. This is not a generic solution
Tika doesn't parse table information. In fact confusing part is that it converts tables tags as <p>
which actually means we lose the structure. This is the case till current version 1.14. In future that may be remedied but no plans till now to work on that direction.
You can refer to JIRA which discusses this shortcoming in Tika. After the JIRA, wiki was also updated to reflect this inadequacy.[Disclaimer: I raised the JIRA]
Now the solution part: In my experience, Aspose.Pdf for Java does a brilliant job for converting pdf into html. But its licensed. You can check the quality via free trial version. Code and example links.
I use a combination of tika (tika-app-1.19.jar) & aspose (aspose-pdf-18.9.1.jar)...
I first modify the pdf using Aspose, to have pipes ('|') at the end of the table-columns... ... and then read it into Tika and convert it to text...
InputStream is = part.getInputStream(); // input-stream of PDF or PDF part
// Aspose add pipes ("|")
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Document pdfDocument = new Document(is); // load existing PDF file
PageCollection pageCollection = pdfDocument.getPages();
int iNumPages = pageCollection.size();
for(int i = 1; i <= iNumPages; i++)
{
Page page = pageCollection.get_Item(i);
TableAbsorber absorber = new TableAbsorber();// Create TableAbsorber object to find tables
absorber.visit(page);// Visit first page with absorber
IGenericList<AbsorbedTable> listTables = absorber.getTableList();
for(AbsorbedTable absorbedTable : listTables)
{
IGenericList<AbsorbedRow> listRows = absorbedTable.getRowList();
for(AbsorbedRow absorbedRow : listRows)
{
IGenericList<AbsorbedCell> listCells = absorbedRow.getCellList();
for(AbsorbedCell absorbedCell : listCells)
{
TextFragmentCollection collectionTextFrag = absorbedCell.getTextFragments();
Rectangle rectangle = absorbedCell.getRectangle();
// Add pipes ("|") to indicate table ends
TextBuilder textBuilder = new TextBuilder(page);
TextFragment textFragment = new TextFragment("|");
double x = rectangle.getURX();
double y = rectangle.getURY();
textFragment.setPosition(new Position(x, y));
textBuilder.appendText(textFragment);
}
}
}
}
pdfDocument.save(outputStream);
is = new ByteArrayInputStream(outputStream.toByteArray()); // input-steam of modified PDF with pipes included ("|")
now the above pdf input stream with pipes ("|") at table cell ends can be pulled into Tika and changed to text...
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
PDFParser pdfParser = new PDFParser();
PDFParserConfig config = pdfParser.getPDFParserConfig();
config.setSortByPosition(true); // needed for text in correct order
pdfParser.setPDFParserConfig(config);
//InputStream stream = new ByteArrayInputStream(sIS.getBytes(StandardCharsets.UTF_8));
pdfParser.parse(is, handler, metadata, context);
String sPdfData = handler.toString();
I found a very helpful blog article here that parses tables using a ContentHandlerDecorator
(with Groovy, but similar enough;):
https://opensource.com/article/17/8/tika-groovy
I adapted it to just parse all <td>
parts into a tab separated line, and collecting the rows in a List by following <tr>
tags, because I needed the table rows to stay intact but no special logic inside table cells.
You can pass your Decorator to the BodyHandler, which wraps it as a delegate, like so:
new AutoDetectParser().parse(inputStream,
new BodyContentHandler(new MyContentHandlerDecorator()),
new Metadata());
© 2022 - 2024 — McMap. All rights reserved.