The PDF in this link (http://www.lenovo.com/psref/pdf/psref450.pdf) contains a number of tables like this:
I'd like to programmatically extract the data and the structure from these tables.
Things I've tried: converting the PDF to HTML using
- Tika: Unfortunately, the tables are converted to space delimited paragraphs - and some of the strings contain spaces so it's notpossible to split them.
- Python's PDFMiner: returned an assertion error due to missing fonts. I suspect the HTML would have been similar to the output from Tika,though I'll need to resolve the issue with the missing fonts to confirm this.
- Online tools: I tried http://www.zamzar.com/ and a couple of others. The file was either too big to process (for the online services) or it generated errors.
I was planning to convert the PDF to HTML and then parse it with BeautifulSoup.
The output could be JSON (e.g. one object per table), XML, or pretty much any format that maintains the structure.