text-extraction Questions
2
Solved
I am trying to extract text from a pdf book and continue to run an issue where sections of copied text fail to retain the proper capitalization properties when pasted into a text document. I have r...
Castara asked 19/7, 2013 at 3:45
8
I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file.
I want to extract the information which is intbetween the paragraph tags, ...
Undetermined asked 6/9, 2009 at 16:52
1
Solved
This is my image:
I used this link(tessaract) to capture and process the image:
http://kurup87.blogspot.com/2012/03/android-ocr-tutorial-image-to-text.html
But this is the issue, if this entire...
Pittman asked 1/6, 2013 at 16:50
1
Solved
I am attempting to gain a better understanding of how a PDF stores text. Generally speaking, when a PDF is created from an application like MS Word (or in my case SQL Server Reporting Services), ho...
Agnomen asked 25/3, 2013 at 19:0
6
Solved
Right now I use Jsoup to extract certain information (not all the text) from some third party webpages, I do it periodically. This works fine until the HTML of certain webpage changes, this change ...
Vestibule asked 4/3, 2013 at 12:45
11
Solved
I would like to extract from a general HTML page, all the text (displayed or not).
I would like to remove
any HTML tags
Any javascript
Any CSS styles
Is there a regular expression (one or mor...
Cogon asked 8/10, 2008 at 1:43
2
Solved
I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords).
And it seems like the...
Sabrasabre asked 4/5, 2010 at 9:20
8
Solved
There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three,...
Intern asked 26/12, 2009 at 1:22
1
Solved
I'm using Tesseract but I don't know whether it neglects any nontext area and targets text only. Do I have to remove any nontext area as a preprocessing step for better output?
Protoplasm asked 17/4, 2012 at 15:5
3
I'll put it right out there: I'm terrible with regular expressions. I've tried to come up with one to solve my problem but I really don't know much about them. . .
Imagine some sentences along the...
Bacchanalia asked 8/12, 2011 at 16:26
4
How can I extract hyphenated strings from this string line?
ADW-CFS-WE CI SLA Def No SLANAME CI Max Outage Service
I just want to extract "ADW-CFS-WE" from it but has been very unsuccess...
Finger asked 2/9, 2011 at 6:52
1
Solved
I am trying to pull the href from a url from some data using php's domDocument.
The following pulls the anchor for the url, but I want the url
$events[$i]['race_1'] = trim($cols->item(1)->nod...
Pyrrhonism asked 12/7, 2011 at 15:53
4
Solved
I would like to extract some data from a piece of text with Vim. The input looks like so:
72" title="(168,72)" onmouseover="posizione('(168,72)');" onmouseout="posizio...
Brimstone asked 3/7, 2011 at 18:44
1
Solved
Is there a way to use readability (text extraction algorithm) and a custom algorithm in python to extract links from text?
I'd like to figure out a way of extracting links that are in the body of ...
Coparcener asked 3/1, 2011 at 23:20
3
Solved
I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc
I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation...
Fluky asked 8/11, 2009 at 15:42
4
I have a series of text items- raw HTML from a MySQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word m...
Romance asked 16/3, 2010 at 8:42
6
Solved
Using sed or similar how would you extract lines from a file? If I wanted lines 1, 5, 1010, 20503 from a file, how would I get these 4 lines?
What if I have a fairly large number of lines I need t...
Precessional asked 6/1, 2010 at 23:6
5
<div class="begin">...</div>
How to match the html inside(including) <div class="begin"> in PHP?
I need a regex solution that can handle nested case.
Iambus asked 7/1, 2010 at 9:56
2
Solved
I'm looking for a PDF library which will allow me to extract the text from a PDF document. I've looked at PyPDF, and this can extract the text from a PDF document very nicely. The problem wit...
Bellringer asked 4/12, 2009 at 17:28
6
Solved
What would be a good tool, or set of tools, to download a list of URLs and extract only the text content?
Spidering is not required, but control over the download file names, and threading would be...
Horned asked 12/1, 2009 at 14:22
2
Solved
My marketing department, bless them, has decided to make a sweepstakes where people enter over a webpage. That is great but the information isn't stored to a DB of any sort but is sent to an exchan...
Monteith asked 30/12, 2008 at 0:5
© 2022 - 2024 — McMap. All rights reserved.