pdf-extraction Questions

2

Am trying to see if I can identify possible table headers in a table inside PDF using background and foreground color of the text. With PyMuPDF text extraction, I was able to get the foreground col...
International asked 26/9, 2019 at 6:30

13

Solved

I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF. Is there a way to check these files to ensure that we are only processing files which ar...
Decoupage asked 16/4, 2019 at 8:54

1

I am trying to extract text from the scanned pdf using PyPDF2. Some of the pdf contains text aligned vertically. But the orientation of the page is Portrait. Is there any way to identify if the tex...
Hube asked 27/9, 2018 at 5:53

2

I use Pdfplumber to extract the table on page 2, section 3 (normally). But it only works on some pdf, others do not work. For failed pdf files, it seems like Pdfplumber read the button table ...
Ovule asked 20/7, 2020 at 17:1

4

Solved

In order to get a single string from a multi-paged PDF I'm doing this: import PyPDF2 pdfFileObject = open('sample.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObject) count = pdfReader.numP...
Grimes asked 13/2, 2020 at 1:3

3

Solved

I am trying to extract Hindi text from a PDF. I tried all the methods to exract from the PDF, but none of them worked. There are explanations why it doesn't work, but no answers as such. So, I deci...
Charlet asked 3/6, 2021 at 6:6

10

Solved

I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an excel file to easily r...
Netti asked 19/4, 2019 at 20:29

3

I want to extract text under specific headings from a pdf using python. For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Su...
Albanese asked 5/1, 2018 at 5:19

1

I am trying to extract images in PDF with BBox coordinates of the image. I tried using pdfrw library, it is identifying image objects and it have an attribute called media box which have some coor...
Abidjan asked 6/2, 2019 at 6:41

2

I am using Camelot to read complete PDFs and extract about 112 attributes from each one. I use table areas to extract the attributes test_variable = camelot.read_pdf(filename, flavor='stream', ...
Cline asked 14/1, 2019 at 6:32

1

Solved

I'm writing a web app that extracts a line at the top of each page in a PDF. The PDFs come from different versions of a product and could go through a number of PDF printers, also in different vers...
Collaboration asked 29/10, 2015 at 11:59

1

I want to extract the contents of a table in pdf like like this : i wrote this java programme using iText java PDF libray which can read the contents of a PDF file line by line, but I do n...
Philosophize asked 9/7, 2015 at 22:0

1

I would like to know how to crawl data inside a pdf file using scrapy. Which module should I use and which is the best and effective way?? Could you please give me some sample tutorials on this Th...
Cynarra asked 8/7, 2015 at 9:10

0

I'm trying to extract text from a particular section of a PDF. If I know the X,Y co-ordinates of the area, I'm able to extract the text. But I'm unable to get the co-ordinates of the selected area ...
Deccan asked 25/6, 2014 at 4:14

2

Solved

I have been trying to write a simple console application or PowerShell script to extract the text from a large number of PDF documents. There are several libraries and CLI tools that offer to do th...
Sericeous asked 27/3, 2014 at 0:8

5

Solved

I have a pdf file including form fields and need to export the data into a xml file AUTOMATICALLY. Here is a screen of a sample form I created for testing: Note: It works great exporting it MANU...
Winthorpe asked 9/1, 2014 at 0:40
1

© 2022 - 2024 — McMap. All rights reserved.