text-extraction Questions

2

Solved

I am trying to extract text from a pdf book and continue to run an issue where sections of copied text fail to retain the proper capitalization properties when pasted into a text document. I have r...

8

I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file. I want to extract the information which is intbetween the paragraph tags, ...

1

Solved

This is my image: I used this link(tessaract) to capture and process the image: http://kurup87.blogspot.com/2012/03/android-ocr-tutorial-image-to-text.html But this is the issue, if this entire...
Pittman asked 1/6, 2013 at 16:50

1

Solved

I am attempting to gain a better understanding of how a PDF stores text. Generally speaking, when a PDF is created from an application like MS Word (or in my case SQL Server Reporting Services), ho...
Agnomen asked 25/3, 2013 at 19:0

6

Solved

Right now I use Jsoup to extract certain information (not all the text) from some third party webpages, I do it periodically. This works fine until the HTML of certain webpage changes, this change ...
Vestibule asked 4/3, 2013 at 12:45

11

Solved

I would like to extract from a general HTML page, all the text (displayed or not). I would like to remove any HTML tags Any javascript Any CSS styles Is there a regular expression (one or mor...
Cogon asked 8/10, 2008 at 1:43

2

Solved

I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords). And it seems like the...
Sabrasabre asked 4/5, 2010 at 9:20

8

Solved

There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three,...
Intern asked 26/12, 2009 at 1:22

1

Solved

I'm using Tesseract but I don't know whether it neglects any nontext area and targets text only. Do I have to remove any nontext area as a preprocessing step for better output?
Protoplasm asked 17/4, 2012 at 15:5

3

I'll put it right out there: I'm terrible with regular expressions. I've tried to come up with one to solve my problem but I really don't know much about them. . . Imagine some sentences along the...
Bacchanalia asked 8/12, 2011 at 16:26

4

How can I extract hyphenated strings from this string line? ADW-CFS-WE CI SLA Def No SLANAME CI Max Outage Service I just want to extract "ADW-CFS-WE" from it but has been very unsuccess...
Finger asked 2/9, 2011 at 6:52

1

Solved

I am trying to pull the href from a url from some data using php's domDocument. The following pulls the anchor for the url, but I want the url $events[$i]['race_1'] = trim($cols->item(1)->nod...
Pyrrhonism asked 12/7, 2011 at 15:53

4

Solved

I would like to extract some data from a piece of text with Vim. The input looks like so: 72" title="(168,72)" onmouseover="posizione('(168,72)');" onmouseout="posizio...
Brimstone asked 3/7, 2011 at 18:44

1

Solved

Is there a way to use readability (text extraction algorithm) and a custom algorithm in python to extract links from text? I'd like to figure out a way of extracting links that are in the body of ...
Coparcener asked 3/1, 2011 at 23:20

3

Solved

I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation...

4

I have a series of text items- raw HTML from a MySQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word m...
Romance asked 16/3, 2010 at 8:42

6

Solved

Using sed or similar how would you extract lines from a file? If I wanted lines 1, 5, 1010, 20503 from a file, how would I get these 4 lines? What if I have a fairly large number of lines I need t...
Precessional asked 6/1, 2010 at 23:6

5

<div class="begin">...</div> How to match the html inside(including) <div class="begin"> in PHP? I need a regex solution that can handle nested case.
Iambus asked 7/1, 2010 at 9:56

2

Solved

I'm looking for a PDF library which will allow me to extract the text from a PDF document. I've looked at PyPDF, and this can extract the text from a PDF document very nicely. The problem wit...
Bellringer asked 4/12, 2009 at 17:28

6

Solved

What would be a good tool, or set of tools, to download a list of URLs and extract only the text content? Spidering is not required, but control over the download file names, and threading would be...
Horned asked 12/1, 2009 at 14:22

2

Solved

My marketing department, bless them, has decided to make a sweepstakes where people enter over a webpage. That is great but the information isn't stored to a DB of any sort but is sent to an exchan...
Monteith asked 30/12, 2008 at 0:5

© 2022 - 2024 — McMap. All rights reserved.