text-extraction Questions

7

Solved

I know we can use PHP DOM to parse HTML using PHP, but I have a specific requirement. I have an HTML content like below <p class="Heading1-P"> <span class="Heading1-H"...
Indignity asked 21/8, 2013 at 4:55

3

I have a general question regarding extracting text, precisely tabular data, from PDF files. How are PDF viewers able to read and display a table? And why can't we just get the necessary column inf...
Esposito asked 22/12, 2012 at 10:19

5

Is there an (unobtrusive, to the user) way to get all the text in a page with Javascript? I could get the HTML, parse it, remove all tags, etc, but I'm wondering if there's a way to get the text fr...
Hagai asked 7/6, 2010 at 3:57

8

Solved

I have an array and in that array I have an array key that looks like, show_me_160 this array key may change a little, so sometimes the page may load and the array key maybe show_me_120, I want to ...
Decury asked 14/10, 2010 at 9:59

6

Solved

I have several strings of the format AA11 AAAAAA1111111 AA1111111 I need to separate the alphabetic and numeric components of the string.
Monopteros asked 13/7, 2012 at 19:47

8

I already can use the textract but with JPEG files. I would like to use it with PDF files. I have the code bellow: import boto3 # Document documentName = "Path to document in JPEG" # Read doc...
Jaclynjaco asked 25/11, 2019 at 18:46

3

Solved

I'm trying to extract the text of a pdf within a given bounding rectangle. I understand there are tools for pdf scraping such as pdfminer, pypdf, and pdftotext. I've experimented with all 3, and so...
Waltz asked 9/4, 2019 at 0:26

2

Solved

I am working on a task to extract some information (in HINDI) from a pdf file and convert it into a data frame. I have tried many things and followed many articles, and answers on stack overflow as...
Loveinidleness asked 31/3, 2023 at 7:58

10

Solved

My question is sort of like this question but I have more constraints: I know the document's are reasonably sane they are very regular (they all came from the same source I want about 99% of the ...
Ancel asked 21/1, 2010 at 23:3

4

Is there a possibility to extract plain text from a PDF-File with PdfSharp? I don't want to use iTextSharp because of its license.
Recall asked 13/4, 2012 at 12:48

2

I want to use textract (via aws cli) to extract tables from a pdf file (located in an s3 location) and export it into a csv file. I have tried writing a .py script but am struggling to read from th...

6

Solved

I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python. It looks like PDFMiner updated their API and all the relevant examples I have found co...
Theotheobald asked 21/10, 2014 at 18:56

7

Solved

I need to isolate the latest occurring integer in a string containing multiple integers. How can I get 23 instead of 1 for $lastnum1? $text = "1 out of 23"; $lastnum1 = $this->getEval(...
Arabele asked 25/9, 2012 at 19:8

6

Solved

I am looking for algorithms that allow text extraction from websites. I do not mean "strip html", or any of the hundreds of libraries that allow this. So for example for a news article I would lik...
Sandpit asked 21/4, 2011 at 15:2

8

Solved

From a string that contains a lot of HTML, how can I extract all the text from <h1><h2>etc tags into a new variable? I would like to capture all of the text from these elements and sto...
Hatpin asked 14/1, 2010 at 14:31

6

Solved

I would like to convert a string of delimited dimension values into floating numbers. For example 152.15 x 12.34 x 11mm into 152.15, 12.34 and 11 and store in an array such that: $dim[0] = 152.15...

5

Solved

I have a MS docx file and I need to extract text from it page-wise. I have tried with python-docx but it could extract the whole text but not pagewise. I have also converted my docx to pdf and th...
Reitareiter asked 18/12, 2019 at 4:53

1

Solved

Attempted Solution at bottom of post. I have near-working code that extracts the sentence containing a phrase, across multiple lines. However, some pages have columns. So respective outputs are inc...

4

Solved

I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF. Right now am doing manually to find the Table from the page. From there I...
Unsatisfactory asked 28/11, 2017 at 14:23

7

I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information. I want to use PDF...
Whatley asked 5/1, 2016 at 7:33

5

sudo python3 -m pip install textract sudo apt-get install textract pip install textract sudo apt-get install swig I want to install textract in python3 but it is not install proper way, it gives ...
Twickenham asked 25/11, 2017 at 6:30

4

I want to grab an img tag from text returned from JSON data like that. I want to grab this from a string: <img class="img" src="https://fbcdn-photos-c-a.akamaihd.net/hphotos-ak-frc3/1239478_598...
Warton asked 6/9, 2013 at 19:15

23

Solved

I want to extract the digits from a string that contains numbers and letters like: "In My Cart : 11 items" I want to extract the number 11.
Herminahermine asked 8/6, 2011 at 11:53

1

I am working on topic modeling tasks using python and I would like to extract texts from annual/sustainability reports. However my problem is, when I tried to extract the report, the extracted line...

8

Solved

I'm having the this text below: [email protected], "assdsdf" <[email protected]>, "rodnsdfald ferdfnson" <[email protected]>, "Affdmdol Gondfgale" <[email protec...
Buchalter asked 21/1, 2013 at 14:11

© 2022 - 2024 — McMap. All rights reserved.