text-extraction - 4

2

Solved

PDF text extraction issue - font/capitalization inconsistencies

I am trying to extract text from a pdf book and continue to run an issue where sections of copied text fail to retain the proper capitalization properties when pasted into a text document. I have r...

pdf character-encoding adobe-indesign text-extraction truetype

Castara asked 19/7, 2013 at 3:45

8

Text Extraction from HTML Java

I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file. I want to extract the information which is intbetween the paragraph tags, ...

java html screen-scraping html-content-extraction text-extraction

Undetermined asked 6/9, 2009 at 16:52

1

Solved

Extracting information from captured image in android

This is my image: I used this link(tessaract) to capture and process the image: http://kurup87.blogspot.com/2012/03/android-ocr-tutorial-image-to-text.html But this is the issue, if this entire...

android image-processing text-extraction

Pittman asked 1/6, 2013 at 16:50

1

Solved

How Does a PDF Store Text

I am attempting to gain a better understanding of how a PDF stores text. Generally speaking, when a PDF is created from an application like MS Word (or in my case SQL Server Reporting Services), ho...

pdf pdf-generation text-extraction

Agnomen asked 25/3, 2013 at 19:0

6

Solved

Extracting webpage information based on a template in Java

Right now I use Jsoup to extract certain information (not all the text) from some third party webpages, I do it periodically. This works fine until the HTML of certain webpage changes, this change ...

java text-extraction named-entity-extraction

Vestibule asked 4/3, 2013 at 12:45

11

Solved

regular expression to extract text from HTML

I would like to extract from a general HTML page, all the text (displayed or not). I would like to remove any HTML tags Any javascript Any CSS styles Is there a regular expression (one or mor...

html regex html-content-extraction text-extraction

Cogon asked 8/10, 2008 at 1:43

2

Solved

tag generation from a small text content (such as tweets)

I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords). And it seems like the...

twitter nlp text-extraction nltk text-analysis

Sabrasabre asked 4/5, 2010 at 9:20

8

Solved

What is the state of the art in HTML content extraction?

There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three,...

html html-content-extraction text-extraction

Intern asked 26/12, 2009 at 1:22

1

Solved

Does Tesseract neglect any nontext area in a scanned document?

I'm using Tesseract but I don't know whether it neglects any nontext area and targets text only. Do I have to remove any nontext area as a preprocessing step for better output?

image-processing ocr tesseract text-extraction

Protoplasm asked 17/4, 2012 at 15:5

3

Regular expression to match object dimensions

I'll put it right out there: I'm terrible with regular expressions. I've tried to come up with one to solve my problem but I really don't know much about them. . . Imagine some sentences along the...

regex parsing text nlp text-extraction

Bacchanalia asked 8/12, 2011 at 16:26

4

Regular expression to match hyphenated words (kebab-case)

How can I extract hyphenated strings from this string line? ADW-CFS-WE CI SLA Def No SLANAME CI Max Outage Service I just want to extract "ADW-CFS-WE" from it but has been very unsuccess...

php regex string text-extraction kebab-case

Finger asked 2/9, 2011 at 6:52

1

Solved

Get href attribute of <a> tag in HTML table cells

I am trying to pull the href from a url from some data using php's domDocument. The following pulls the anchor for the url, but I want the url $events[$i]['race_1'] = trim($cols->item(1)->nod...

php dom html-parsing domdocument text-extraction

Pyrrhonism asked 12/7, 2011 at 15:53

4

Solved

How to extract text matching a regex using Vim?

I would like to extract some data from a piece of text with Vim. The input looks like so: 72" title="(168,72)" onmouseover="posizione('(168,72)');" onmouseout="posizio...

vim text extract text-extraction

Brimstone asked 3/7, 2011 at 18:44

1

Solved

Is there a way to use readability (text extraction algorithm) and a custom algorithm in python to extract links from text?

Is there a way to use readability (text extraction algorithm) and a custom algorithm in python to extract links from text? I'd like to figure out a way of extracting links that are in the body of ...

python html-content-extraction text-extraction

Coparcener asked 3/1, 2011 at 23:20

3

Solved

Extracting pure content / text from HTML Pages by excluding navigation and chrome content

I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation...

html artificial-intelligence nlp html-content-extraction text-extraction

Fluky asked 8/11, 2009 at 15:42

4

How to extract common / significant phrases from a series of text entries

I have a series of text items- raw HTML from a MySQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word m...

nlp text-extraction nltk text-analysis

Romance asked 16/3, 2010 at 8:42

6

Solved

How do I extract lines from a file using their line number on unix?

Using sed or similar how would you extract lines from a file? If I wanted lines 1, 5, 1010, 20503 from a file, how would I get these 4 lines? What if I have a fairly large number of lines I need t...

unix sed awk line-numbers text-extraction

Precessional asked 6/1, 2010 at 23:6

5

Extract a <div> element with a specified class and it's contents from an HTML document

<div class="begin">...</div> How to match the html inside(including) <div class="begin"> in PHP? I need a regex solution that can handle nested case.

php html regex html-parsing text-extraction

Iambus asked 7/1, 2010 at 9:56

2

Solved

PDF Parsing Using Python - extracting formatted and plain texts [closed]

I'm looking for a PDF library which will allow me to extract the text from a PDF document. I've looked at PyPDF, and this can extract the text from a PDF document very nicely. The problem wit...

python pdf parsing text-extraction information-extraction

Bellringer asked 4/12, 2009 at 17:28

6

Solved

HTML downloading and text extraction

What would be a good tool, or set of tools, to download a list of URLs and extract only the text content? Spidering is not required, but control over the download file names, and threading would be...

html linux text-extraction download

Horned asked 12/1, 2009 at 14:22

2

Solved

Extracting data from an email message (or several thousand emails) [Exchange based]

My marketing department, bless them, has decided to make a sweepstakes where people enter over a webpage. That is great but the information isn't stored to a DB of any sort but is sent to an exchan...

exchange-server text-extraction

Monteith asked 30/12, 2008 at 0:5

text-extraction Questions

Recommended topics

Hot tags