pdf-scraping Questions

6

Solved

Is that even possible!?! I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any R packages that can read pdf? Or should I leave t...
Groscr asked 7/2, 2012 at 23:46

10

In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying: File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", lin...
Iceni asked 28/1, 2015 at 13:2

4

I'm trying to extract data from tables inside some pdf reports. I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to ext...
Adytum asked 23/5, 2017 at 17:15

6

I have thousands of pdf file that I need to extract data from.This is an example pdf. I want to extract this information from the example pdf. I am open to nodejs, python or any other effective...
Chancellor asked 14/9, 2019 at 21:42

3

I am trying to get data from PDFs available on the site https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en For example, If I look at November 2019 report https://downloads...
Horizon asked 1/12, 2019 at 22:43

2

Solved

How to scrape tables preceded with some title text from PDF? I am experimenting with tabulizer package. Here an example of getting a table from a specific page (Polish "Map of Public Health Needs"...
Helman asked 28/1, 2019 at 14:8

4

I am working on a pdf file. There is number of tables in that pdf. According to the table names given in the pdf, I wanted to fetch the data from that table using python. I have worked on ht...
Citrin asked 20/3, 2012 at 7:42

0

I'm using "pdftotext -bbox file.pdf" to convert a pdf file into HTML. Here's a sample line from the output: <word xMin="351.852025" yMin="42.548936" xMax="365.689478" yMax="47.681498">foo&l...
Elegit asked 6/5, 2018 at 11:23

3

Solved

Are there any open source libraries that support table identification & extraction? By this I mean: Identify a table structure exists Classify the table from its contents Extract data...
Pericline asked 16/2, 2015 at 0:4

2

I am using python 3.5 and I want to read the text, line by line from pdf files. Was trying to use pdfminer3k but not getting proper syntax anywhere. How to use it correctly?
Linguini asked 17/5, 2017 at 12:20

7

I am trying to scrape PDF tables which span across multiple pages. I tried many things but the best seems to be pdftotext -layout as advised here. The problem is that the resultant text file is not...
Mt asked 6/8, 2013 at 10:58

1

Solved

I'm searching for an API or a program (preferably Python and open-source) which lets me download the first n pictures of a Google Image Search for let's say bicycles. It would also be helpful...
Oneness asked 7/4, 2016 at 12:3

1

Solved

I have written python code that scrapes all the data from the PDF file. The problem here is that once it is scraped,the words lose their grammer. How to fix these problem? I am attaching the code. ...
Patsy asked 14/3, 2016 at 18:50

1

I am developing a C# winform application that converts the pdf contents to text. All the required contents are extracted except the content found in highlighted text of the pdf. Please help to get ...
Bypath asked 28/4, 2014 at 13:31

13

Solved

Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use.
Brotherson asked 25/8, 2008 at 4:44

1

Solved

I have tried the example code recommended in the tm::readPDF documentation: library(tm) if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) { uri <- system.file(file.path("doc", "tm.pd...
Cortie asked 6/8, 2013 at 12:37

3

I have a requirement to split a large pdf document into smaller files based on the content of the file. We use BCL easyPDF to manipulate pdf files. easyPDF can split pdf documents based on a ...
Sweetie asked 3/5, 2012 at 18:19

1

I have PDF files with text that should be replaced. More specificly, the text should be translated and replaced with the translated version. It's important that the rest of the PDF structure stays ...
Openmouthed asked 5/7, 2011 at 23:50
1

© 2022 - 2024 — McMap. All rights reserved.