Is it possible to extract table infomation using Apache Tika?
Asked Answered
B

4

11

I am looking at a parser for pdf and MS office document formats to extract tabular information from files. Was thinking of writing separate implementations when I saw Apache Tika. I am able to extract full text from any of these file formats. But my requirement is to extract tabular data where I am expecting 2 columns in a key value format. I checked most of the stuff available in the net for a solution but could not find any. Any pointers for this?

Brandiebrandise answered 22/11, 2012 at 16:48 Comment(0)
B
8

Well I went ahead and implemented it separately using apache poi for the MS formats. I came back to Tika for PDF. What Tika does with the docs is that it will output it as "SAX based XHTML events"1

So basically we can write a custom SAX implementation to parse the file.

The structure text output will be of the form (Meta details avoided)

<body><div class="page"><p/>
<p>Key1 Value1 </p>
<p>Key2 Value2 </p>
<p>Key3 Value3</p>
<p/>
</div>
</body>

In our SAX implementation we can consider the first part as key (for my problem I already know the key and I am looking for values, so it is a substring).

Override public void characters(char[] ch, int start, int length) with the logic

Please note for my case the structure of the content is fixed and I know the keys that are coming in, so it was easy doing it this way. This is not a generic solution

Brandiebrandise answered 26/11, 2012 at 13:53 Comment(2)
Hey Rajesh, After a year I am facing same problem as yours :) I would like to know if there is any generic solution to this problem. In my case pdf files will contain any type of table structure and I have to make sure that tables are extracted properly and if possible annotate table captions. Is it possible to do using Tika? Or is there any other API which can do this?Cholecalciferol
@Cholecalciferol I didnt get any generic solution. Basically you should be able to do this in MS formats but i doubt if pdf is possible(refer:https://mcmap.net/q/1016146/-working-on-tables-in-pdf-using-python-duplicate . Same thread has some python solution which might work if you know the table caption. Not tried myself).Brandiebrandise
S
8

Tika doesn't parse table information. In fact confusing part is that it converts tables tags as <p> which actually means we lose the structure. This is the case till current version 1.14. In future that may be remedied but no plans till now to work on that direction.

You can refer to JIRA which discusses this shortcoming in Tika. After the JIRA, wiki was also updated to reflect this inadequacy.[Disclaimer: I raised the JIRA]

Now the solution part: In my experience, Aspose.Pdf for Java does a brilliant job for converting pdf into html. But its licensed. You can check the quality via free trial version. Code and example links.

Streptomycin answered 1/2, 2017 at 13:36 Comment(1)
Tabula (tabula.technology) is a free, MIT licensed option for extracting tables from PDFs. If you'd like us to integrate that with Tika, please open an issue on our JIRA.Electra
M
2

I use a combination of tika (tika-app-1.19.jar) & aspose (aspose-pdf-18.9.1.jar)...

I first modify the pdf using Aspose, to have pipes ('|') at the end of the table-columns... ... and then read it into Tika and convert it to text...

InputStream is = part.getInputStream(); // input-stream of PDF or PDF part

// Aspose add pipes ("|")
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Document pdfDocument   = new Document(is);   // load existing PDF file

PageCollection pageCollection = pdfDocument.getPages();
int iNumPages = pageCollection.size();

for(int i = 1; i <= iNumPages; i++)
{
    Page page = pageCollection.get_Item(i);
    TableAbsorber absorber = new TableAbsorber();// Create TableAbsorber object to find tables
    absorber.visit(page);// Visit first page with absorber

    IGenericList<AbsorbedTable> listTables = absorber.getTableList();

    for(AbsorbedTable absorbedTable : listTables)
    {
        IGenericList<AbsorbedRow> listRows = absorbedTable.getRowList();

        for(AbsorbedRow absorbedRow : listRows)
        {
            IGenericList<AbsorbedCell> listCells = absorbedRow.getCellList();

            for(AbsorbedCell absorbedCell : listCells)
            {
                TextFragmentCollection  collectionTextFrag = absorbedCell.getTextFragments();

                Rectangle rectangle = absorbedCell.getRectangle();

                // Add pipes ("|") to indicate table ends
                TextBuilder  textBuilder  = new TextBuilder(page);
                TextFragment textFragment = new TextFragment("|");
                double x = rectangle.getURX();
                double y = rectangle.getURY();
                textFragment.setPosition(new Position(x, y));
                textBuilder.appendText(textFragment);
            }
        }
    }
}
pdfDocument.save(outputStream);
is = new ByteArrayInputStream(outputStream.toByteArray()); // input-steam of modified PDF with pipes included ("|")

now the above pdf input stream with pipes ("|") at table cell ends can be pulled into Tika and changed to text...

BodyContentHandler handler   = new BodyContentHandler();
Metadata           metadata  = new Metadata();
ParseContext       context   = new ParseContext();
PDFParser          pdfParser = new PDFParser();

PDFParserConfig config = pdfParser.getPDFParserConfig();
config.setSortByPosition(true); // needed for text in correct order
pdfParser.setPDFParserConfig(config);

//InputStream stream = new ByteArrayInputStream(sIS.getBytes(StandardCharsets.UTF_8));
pdfParser.parse(is, handler, metadata, context);
String sPdfData = handler.toString();
Midvictorian answered 29/10, 2018 at 5:25 Comment(0)
C
0

I found a very helpful blog article here that parses tables using a ContentHandlerDecorator (with Groovy, but similar enough;): https://opensource.com/article/17/8/tika-groovy

I adapted it to just parse all <td> parts into a tab separated line, and collecting the rows in a List by following <tr> tags, because I needed the table rows to stay intact but no special logic inside table cells.

You can pass your Decorator to the BodyHandler, which wraps it as a delegate, like so:

new AutoDetectParser().parse(inputStream,
    new BodyContentHandler(new MyContentHandlerDecorator()),
    new Metadata());
Conveyancer answered 19/2, 2019 at 14:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.