How to extract the contents of a table in pdf file? [duplicate]
Asked Answered
P

1

7

I want to extract the contents of a table in pdf like like this :

enter image description here

i wrote this java programme using iText java PDF libray which can read the contents of a PDF file line by line, but I do not know how to get the contents of table

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;

public class PDFReader {

    public static void main(String[] args) {

        // TODO, add your application code
        System.out.println("Lecteur PDF");
        System.out.println (ReadPDF("D:/test.pdf"));
    }
        private static String ReadPDF(String pdf_url)
    {
        StringBuilder str=new StringBuilder();
        try
        {

         PdfReader reader = new PdfReader(pdf_url);
        int n = reader.getNumberOfPages();
         for(int i=1;i<n;i++)
         {
            String str2=PdfTextExtractor.getTextFromPage(reader, i);
            str.append(str2);
           System.out.println(str);
         }
        }catch(Exception err)
        {
            err.printStackTrace();
        }
        return String.format("%s", str);
    }
}

this is what I get :

enter image description here

but that's not what I want, I want to extract the contents of the table line by line and column by column, for example, save each line in an java array

the first array will contain : "N°", "DATE OBSERVATIONS", "TEXTE"

the second array will contain : "029/14", "Le 1er sept 2014 remplace AVURNAV...", "SETE A compter du lundi 7 juillet 2014 débuteront les trav..."

the third array will contain : "037/14", "Le 15 octobre 2014 remplace AVURNAV ...", "SETE Du 15 septembre 2014 au 15 juillet 2015, travaux ...."

and so on

Thanks

Philosophize answered 9/7, 2015 at 22:0 Comment(5)
Repeat after me: "there is no table. All Tableness you may think exists in this PDF is a mere illusion." From the order of texts you extracted, you can see it works sort of top to bottom, left to right. You need exact coordinates for each text, and an approximate value for each column and row. Only then you can rebuild it.Crisscross
@Crisscross Amendment to your mantra: "There is no table. All Tableness you may think exists in this PDF is a mere illusion... unless the PDF is a Tagged PDF." Unfortunately, the OP doesn't provide a link to his PDF so that we can check if it is tagged. So, dear anonymous user: please update your question and tell us whether your PDF is Tagged or not.Rafael
@BrunoLowagie: Does such a tagged file contain tags for both rows and columns? (I have not (yet) needed this particular workflow.) Then indeed it should be possible.Crisscross
You can think of a table inside a tagged PDF as if it were an HTML table. There are differences, but you can define rows, cells in rows, column headers, row headers, etc. Tagged is really powerful, but unfortunately most of the PDFs found in the wild aren't tagged. This is going to change though as more and more governments will require PDFs to be accessible (read PDF/UA compliant). If I recall correctly, Section 508 on accessibility has recently be amended to include a reference to PDF/UA (or there are plans to do so; sometimes people tell me stuff like this before it happens).Rafael
@Crisscross We can rebuild it. We have the technology. We can make it better than it was. Better...stronger...faster.Libertarian
H
2

You may have to identify common field beginning/end character sequences to split your data into an array if your PDF library doesn't support extracting tables. For instance the first fields is nnn/nn, the second field ends nnnn/nn and the third field ends where the next first field begins.

This is a tricky problem - I have had to use coordinate based approaches to deal with this before, but your pdf library may not support extracting the position of letters as well as the actual text.

Healall answered 9/7, 2015 at 22:8 Comment(1)
iText does allow you to get x and y coordinates of all text snippets and even of all glyphs, but it remains a tricky problem as explained in the answer to the original question.Rafael

© 2022 - 2024 — McMap. All rights reserved.