text-extraction Questions

2

Solved

I want to extract text from pdf file using only Javascript in the client side without using the server. I've already found a javascript code in the following link: extract text from pdf in Ja...
Lydgate asked 2/7, 2013 at 11:39

6

Solved

I need to extract text from pdf files using iText. The problem is: some pdf files contain 2 columns and when I extract text I get a text file where columns are merged as the result (i.e. text from...
Saltus asked 26/10, 2010 at 21:37

3

Solved

I need to extract pure text form a random web page at runtime, on the server side. I use Google App Engine, and Readability python port. There are a number of those. early version by gfxmonk, bas...

2

Solved

I have streams of files being read from a directory and the filetree is of the form: /repository/resources/2016-03-04/file.csv /repository/resources/2016-03-04/file2.csv /repository/resources/2016...
Drummer asked 7/4, 2016 at 12:48

2

Solved

I have code that extracts text from a PDF using a filetotext class. Worked until last week when something changed in the pdf's being generated. Weird thing is that it appears the characters are the...
Quicken asked 13/8, 2015 at 23:33

1

Solved

I am using iText to extract some text from a pdf file at a specific location. In order to do that I am using the LocationTextExtractionStrategy: public static void main(String[] args) throws Excep...
Mountaintop asked 11/2, 2016 at 16:37

3

I would like to extract text from a portion (using coordinates) of PDF using Ghostscript. Can anyone help me out?
Renfroe asked 31/5, 2011 at 11:59

4

Solved

I need to extract text from a node like this: <div> Some text <b>with tags</b> might go here. <p>Also there are paragraphs</p> More text can go without paragraphs&...
Aleph asked 16/4, 2012 at 16:19

1

I want to extract the contents of a table in pdf like like this : i wrote this java programme using iText java PDF libray which can read the contents of a PDF file line by line, but I do n...
Philosophize asked 9/7, 2015 at 22:0

3

Solved

I'm trying to figure out how to extract dates from unstructured text using Ruby. For example, I'd like to parse the date out of this string "Applications started after 12:00 A.M. Midnight (EST) Fe...
Fascinate asked 17/2, 2010 at 0:34

5

Solved

I have a record of conversations between two arbitrary persons A and B. c1 <- "Person A: blabla...something Person B: blabla something else Person A: OK blabla" c2 <- "Person A: again blabla...
Doddered asked 23/4, 2015 at 8:34

1

I have a problem. I have a list of SKU numbers (hundreds) that I'm trying to match with the title of the product that it belongs to. I have thought of a few ways to accomplish this, but I feel lik...
Bewitch asked 25/3, 2015 at 22:37

2

I have to extract text from invoices and bills pdf files The files layouts can get complex, though its mostly filled with tables. I've read a few dozens articles already about the pdf format, how...
Scholar asked 17/4, 2012 at 10:5

15

Can anyone recommend a library/API for extracting the text and images from a PDF? We need to be able to get at text that is contained in pre-known regions of the document, so the API will nee...
Monecious asked 6/9, 2010 at 11:11

4

Solved

Does anybody know a .net port for the boilerpipe library?
Camshaft asked 2/1, 2012 at 20:42

4

I need to extract some numbers from a text. Text is x <- "Lorem ipsum dolor sit amet[245], consectetur adipiscing (325). Deinde prima illa, quae in congressu[232]. solemus: Quid tu, inquit, huc...
Melbourne asked 24/8, 2014 at 18:20

6

Anyone know of anything they can recommend in order to extract just the plain text from a .doc or .docx? I've found this - wondered if there were any other suggestions?
Tompion asked 15/4, 2011 at 3:12

6

Solved

I am trying to use boilerpipe java library, to extract news articles from a set of websites. It works great for texts in english, but for text with special characters, for example, words with acce...
Crescen asked 13/2, 2012 at 11:51

4

Solved

I have text files with repeated exact lines of text, but I only want one of each. Imagine this text file: AAAAA AAAAA AAAAA BB BBBBB BBBBB CCC CCC CCC I would only need the following four lines ...
Lais asked 14/7, 2014 at 10:46

2

Solved

I need a clever regex to match ... in these: <img src="..." <img src='...' <img src=... I want to match the inner content of src, but only if it is surrounded by ", ' or none. This mean...
Spasmodic asked 28/10, 2010 at 22:26

1

Solved

I am stuck with a regular expression. $matches = array(); // $controller = $this->getRequest()->attributes->get('_controller'); $controller = "Acme\MyBundle\Controller\MyControl...
Schulze asked 4/4, 2014 at 4:11

1

I am looking to Extract Text with its Font Details (Style,Size,color,Italic etc) from a PDF in Python. I need to extract text and its metadata for translation purpose.Can anyone suggest any l...
Chlamydate asked 21/2, 2014 at 6:20

13

Solved

Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use.
Brotherson asked 25/8, 2008 at 4:44

1

iTextSharp 4.1.6 is the last version licensed under LGPL and is free to use in commercial purpose without paying license fees. It might be interesting for some and for me, how to extract text with...
Consecrate asked 13/4, 2012 at 14:50

2

Solved

I'm trying to get my way through Poppler and its (lack of) documentation. What I want to do is a very simple thing: open a PDF file and read the text in it. I'm then going to process the text, but...
Floreneflorentia asked 28/4, 2010 at 18:31

© 2022 - 2024 — McMap. All rights reserved.