text-extraction Questions
7
Solved
I know we can use PHP DOM to parse HTML using PHP, but I have a specific requirement. I have an HTML content like below
<p class="Heading1-P">
<span class="Heading1-H"...
Indignity asked 21/8, 2013 at 4:55
3
I have a general question regarding extracting text, precisely tabular data, from PDF files.
How are PDF viewers able to read and display a table? And why can't we just get the necessary column inf...
Esposito asked 22/12, 2012 at 10:19
5
Is there an (unobtrusive, to the user) way to get all the text in a page with Javascript? I could get the HTML, parse it, remove all tags, etc, but I'm wondering if there's a way to get the text fr...
Hagai asked 7/6, 2010 at 3:57
8
Solved
I have an array and in that array I have an array key that looks like, show_me_160 this array key may change a little, so sometimes the page may load and the array key maybe show_me_120, I want to ...
Decury asked 14/10, 2010 at 9:59
6
Solved
I have several strings of the format
AA11
AAAAAA1111111
AA1111111
I need to separate the alphabetic and numeric components of the string.
Monopteros asked 13/7, 2012 at 19:47
8
I already can use the textract but with JPEG files. I would like to use it with PDF files.
I have the code bellow:
import boto3
# Document
documentName = "Path to document in JPEG"
# Read doc...
Jaclynjaco asked 25/11, 2019 at 18:46
3
Solved
I'm trying to extract the text of a pdf within a given bounding rectangle. I understand there are tools for pdf scraping such as pdfminer, pypdf, and pdftotext. I've experimented with all 3, and so...
Waltz asked 9/4, 2019 at 0:26
2
Solved
I am working on a task to extract some information (in HINDI) from a pdf file and convert it into a data frame.
I have tried many things and followed many articles, and answers on stack overflow as...
Loveinidleness asked 31/3, 2023 at 7:58
10
Solved
My question is sort of like this question but I have more constraints:
I know the document's are reasonably sane
they are very regular (they all came from the same source
I want about 99% of the ...
Ancel asked 21/1, 2010 at 23:3
4
Is there a possibility to extract plain text from a PDF-File with PdfSharp?
I don't want to use iTextSharp because of its license.
Recall asked 13/4, 2012 at 12:48
2
I want to use textract (via aws cli) to extract tables from a pdf file (located in an s3 location) and export it into a csv file. I have tried writing a .py script but am struggling to read from th...
Pocked asked 13/10, 2020 at 17:18
6
Solved
I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python.
It looks like PDFMiner updated their API and all the relevant examples I have found co...
Theotheobald asked 21/10, 2014 at 18:56
7
Solved
I need to isolate the latest occurring integer in a string containing multiple integers.
How can I get 23 instead of 1 for $lastnum1?
$text = "1 out of 23";
$lastnum1 = $this->getEval(...
Arabele asked 25/9, 2012 at 19:8
6
Solved
I am looking for algorithms that allow text extraction from websites. I do not mean "strip html", or any of the hundreds of libraries that allow this.
So for example for a news article I would lik...
Sandpit asked 21/4, 2011 at 15:2
8
Solved
From a string that contains a lot of HTML, how can I extract all the text from <h1><h2>etc tags into a new variable?
I would like to capture all of the text from these elements and sto...
Hatpin asked 14/1, 2010 at 14:31
6
Solved
I would like to convert a string of delimited dimension values into floating numbers.
For example
152.15 x 12.34 x 11mm
into
152.15, 12.34 and 11
and store in an array such that:
$dim[0] = 152.15...
Pule asked 3/6, 2009 at 12:16
5
Solved
I have a MS docx file and I need to extract text from it page-wise.
I have tried with python-docx but it could extract the whole text but not pagewise.
I have also converted my docx to pdf and th...
Reitareiter asked 18/12, 2019 at 4:53
1
Solved
Attempted Solution at bottom of post.
I have near-working code that extracts the sentence containing a phrase, across multiple lines.
However, some pages have columns. So respective outputs are inc...
Stroboscope asked 30/11, 2021 at 13:56
4
Solved
I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF.
Right now am doing manually to find the Table from the page. From there I...
Unsatisfactory asked 28/11, 2017 at 14:23
7
I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information.
I want to use PDF...
Whatley asked 5/1, 2016 at 7:33
5
sudo python3 -m pip install textract
sudo apt-get install textract
pip install textract
sudo apt-get install swig
I want to install textract in python3 but it is not install proper way, it gives ...
Twickenham asked 25/11, 2017 at 6:30
4
I want to grab an img tag from text returned from JSON data like that. I want to grab this from a string:
<img class="img" src="https://fbcdn-photos-c-a.akamaihd.net/hphotos-ak-frc3/1239478_598...
Warton asked 6/9, 2013 at 19:15
23
Solved
I want to extract the digits from a string that contains numbers and letters like:
"In My Cart : 11 items"
I want to extract the number 11.
Herminahermine asked 8/6, 2011 at 11:53
1
I am working on topic modeling tasks using python and I would like to extract texts from annual/sustainability reports. However my problem is, when I tried to extract the report, the extracted line...
Drove asked 25/8, 2021 at 8:4
8
Solved
I'm having the this text below:
[email protected], "assdsdf" <[email protected]>, "rodnsdfald ferdfnson" <[email protected]>, "Affdmdol Gondfgale" <[email protec...
Buchalter asked 21/1, 2013 at 14:11
1 Next >
© 2022 - 2024 — McMap. All rights reserved.