I have (same) data saved as a GIF image file and as a PDF file and I want to parse it to HTML or XML. The data is actually the menu for my university's cafeteria. That means that there is a new version of the file that has to be parsed each week! In General, the files contain some header and footer text, as well as a table full of other data in between. I have read some posts on stackoverflow and I also had started some attempts to parse out the table data as HTML/XML:
- PDFBox || iText (Java)
- Google Docs Import
- PDF2HTML || PDF2Table
GIF
- Tesseract-OCR
I have got the best result from parsing the PDF-file with PDFBox, but still (as the menu changes weekly), it is not reliable enough. The HTML that I receive includes sometimes more, sometimes less "paragraphs" (<p>
), so that I am not able to parse the data precice enough.
That is why I would like to know if there is an other way to do it?