Converting .pdf files to excel (.xls)

Asked 12/12, 2012 at 16:12 Answered 12/6, 2022 at 12:56

A friend of mine doing an internship asked me 2 hours ago if I could help him avoid to do manually 462 pdf file to .xls using free online soft.

I thought of a shell script using unoconv, but I didn't find out how to use it properly, and I am not sure if unoconv can solve this problem since it mainly converts file to pdf, not the reverse thing.

Organ answered 12/12, 2012 at 16:12 Comment(1)

What do you expect the conversion to do? For myself, I have no idea what converting a graphical page description document to a spreadsheet document could mean. – Mis 12/12, 2012 at 17:40

Conversion from PDF to any other structured format is not always possible and not generally recommended.

Having said that, this does look like a one-off job and there's a fair few of them (462).

It's worth pursuing, if you can reliably extract text from most of them and it's reasonably structured. It's a matter of trying to get regular text output across a sample of the PDF's that you can reliably parse into a table structure.

There's plenty of tools around that target either direct or OCR based text extraction, just google around.

One I like is pstotext from the ghostscript suite; the -bboxes option lets me get the coordinates of each word and leaves it up to me to re-assemble the structure. Despite its name it does work on input PDFs. Downside is that it can be a bit flakey and works on some PDF's but not others.

If you get this far, you'd then most likely then need to write a shell-script or program to convert that to a CSV. You can either open this directly via a spread-sheet or look for tools to convert this into XLS.

PS If he hasn't already, get the intern to ask if there's any possible way of getting at the original data that was used to created the PDFs It will save a lot of time and effort and lead to a way more accurate result.

Update An alternative to pstotext is renderpdf.pl command which is included in the Perl CAM::PDF module. More robust, but just reports text (x,y) position, not bounding boxes.

Brigitta answered 13/12, 2012 at 1:18 Comment(3)

Thanks for this response, I will try pstotext as soon as go home. He is not an IT person so your "PS" may be very usefull, worth the shot to ask him! Thanks for everything – Organ 13/12, 2012 at 8:58

Okok thanks, but i already solved the problem. If this can hlep someone just upvote his response! – Organ 21/12, 2012 at 14:50

Other answers included Okular and its table selection tool (ctrl+5), which lets you manually fix up cell borders. – Thurible 28/10, 2021 at 7:44

Other responses on a linked question suggest Tabula, too.

https://github.com/tabulapdf/tabula

I tried and it works very well.

Busra answered 12/6, 2022 at 12:56 Comment(0)

Recommended topics

Hot tags