text-extraction Questions

2

Solved

We are using PDFBox to extract text from PDF's. Some PDF's text can't be extract correctly. The following image shows a part from the PDF as image: After text extraction we get the following te...
Camel asked 10/4, 2015 at 6:1

2

I am working on reading the highlighted from PDF document using PDBox. I was able to read the highlighted text in single line both single and multiple words. However, I could not read the highlight...
Arlinda asked 16/9, 2015 at 12:3

0

I'm trying to setup Tika for text extraction using python. I've installed Java runtime jre 1.8.0, Installed tika with pip install tika==1.23, Downloaded the tika server jar file from this link, and...
Pessa asked 2/3, 2021 at 18:36

4

Solved

i want to detect text area from image as a preprocessing step for tesseract OCR engine, the engine works well when the input is text only but when the input image contains Nontext content it falls,...
Rakia asked 18/4, 2012 at 9:25

14

Solved

I have a string that has two single quotes in it, the ' character. In between the single quotes is the data I want. How can I write a regex to extract "the data i want" from the following text? m...
Osiris asked 11/1, 2011 at 20:22

3

I want to extract text under specific headings from a pdf using python. For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Su...
Albanese asked 5/1, 2018 at 5:19

13

Solved

I have a URL and I need to get the value of v from this URL. Here is my URL: http://www.youtube.com/watch?v=_RCIP6OrQrE How can I do that?
Kindig asked 31/7, 2012 at 5:14

2

I have data which in a structured table image. The data is like below: I tried to extract the text from this image using this code: import pytesseract from PIL import Image value=Image.open("d...
Residence asked 17/12, 2019 at 8:55

6

Solved

I want to change the displayed username from [email protected] to only abcd. For this, I should clip the part starting from @. I can do this very easily through variablename.substring() functi...
Chymotrypsin asked 22/4, 2011 at 5:40

5

Solved

Consider the following example: case Foo: ... break; case Bar: ... break; case More: case Complex: ... break: ... Say, we would like to retrieve all matches of the regex case \([^:]*\): (the...
Alcantar asked 31/1, 2012 at 12:33

1

I am using the following code in python: I am getting the following key values in the dictionary: 'block_num' 'conf' 'level' 'line_num' 'page_num' 'par_num', 'text', 'top', 'width', 'word_num', '...
Veronica asked 21/6, 2019 at 7:38

6

Solved

I am looking to get the filename from the end of a filepath string, say $text = "bob/hello/myfile.zip"; I want to be able to obtain the file name, which I guess would involve getting eve...
Jehanna asked 30/6, 2010 at 14:2

0

I am trying to convert all readable in a pdf file into a string using textract. It works for most of the files but in some it gives UnicodeDecodeError: I want to skip problematic characters. I hav...
Alible asked 25/10, 2019 at 11:51

10

Solved

I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI bu...
Mabellemable asked 18/6, 2009 at 7:20

4

Solved

I try to use stringr package to extract part of a string, which is between two particular patterns. For example, I have: my.string <- "nanaqwertybaba" left.border <- "nana" right.border &l...
Fredia asked 7/4, 2014 at 22:21

4

Solved

I am trying to compare two sentences and see if they contain the same set of words. Eg: comparing "today is a good day" and "is today a good day" should return true I am using the Counter function...
Willner asked 25/6, 2019 at 20:22

2

When I run the below Python script on a directory that contains a PDF file, I keep getting this error: ShellError: The command pdftotext "path/to/pdf/title.pdf" - failed with exit code 1 ------...
Bianco asked 8/4, 2015 at 17:1

2

Solved

I couldn't install textract in google colab, error message showing as below. some people suggest to use sudo apt-get install libasound2-dev but how to do sudo... in google colab? === error mess...
Wester asked 10/1, 2019 at 6:10

0

I'm using "pdftotext -bbox file.pdf" to convert a pdf file into HTML. Here's a sample line from the output: <word xMin="351.852025" yMin="42.548936" xMax="365.689478" yMax="47.681498">foo&l...
Elegit asked 6/5, 2018 at 11:23

3

I have extracted a image document from tesseract and It has extracted successful. But I am not able to understand coordinate of extracted document. Problem description: - It showing coordinates ...
Incommodious asked 31/8, 2013 at 16:38

1

Solved

I am working on a project where I need to extract "technology related keywords/keyphrases" from text. For example, my text is: "ABC Inc has been working on a project related to machine lear...
Herald asked 13/3, 2018 at 18:28

2

Solved

I have fitted a CountVectorizer to some documents in scikit-learn. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example...

4

Solved

I have a large set of real-world text that I need to pull words out of to input into a spell checker. I'd like to extract as many meaningful words as possible without too much noise. I know there's...
Alongshore asked 19/4, 2011 at 14:22

8

Solved

I have a file that looks something like this: <table name="content_analyzer" primary-key="id"> <type="global" /> </table> <table name="content_analyzer2" primary-key...
Toilet asked 22/2, 2011 at 16:34

5

Solved

I'm trying to write regex that extracts all hex colors from CSS code. This is what I have now: Code: $css = <<<CSS /* Do not match me: #abcdefgh; I am longer than needed. */ .foo { c...
Dialectical asked 11/10, 2012 at 10:54

© 2022 - 2024 — McMap. All rights reserved.