text-extraction - 2

2

Solved

Apache PDFBox Remove Spaces between characters

We are using PDFBox to extract text from PDF's. Some PDF's text can't be extract correctly. The following image shows a part from the PDF as image: After text extraction we get the following te...

pdfbox text-extraction pdf-parsing

Camel asked 10/4, 2015 at 6:1

2

Not able to read the exact text highlighted across the lines

I am working on reading the highlighted from PDF document using PDBox. I was able to read the highlighted text in single line both single and multiple words. However, I could not read the highlight...

java pdf pdfbox text-extraction

Arlinda asked 16/9, 2015 at 12:3

0

Tika server returned status: 404

I'm trying to setup Tika for text extraction using python. I've installed Java runtime jre 1.8.0, Installed tika with pip install tika==1.23, Downloaded the tika server jar file from this link, and...

java python apache-tika text-extraction tika-server

Pessa asked 2/3, 2021 at 18:36

4

Solved

How to detect Text Area from image?

i want to detect text area from image as a preprocessing step for tesseract OCR engine, the engine works well when the input is text only but when the input image contains Nontext content it falls,...

c++image-processing tesseract text-extraction

Rakia asked 18/4, 2012 at 9:25

14

Solved

How to extract a substring using regex

I have a string that has two single quotes in it, the ' character. In between the single quotes is the data I want. How can I write a regex to extract "the data i want" from the following text? m...

java regex string text-extraction

Osiris asked 11/1, 2011 at 20:22

3

How to extract text under specific headings from a pdf?

I want to extract text under specific headings from a pdf using python. For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Su...

python-2.7 pdf document text-extraction pdf-extraction

Albanese asked 5/1, 2018 at 5:19

13

Solved

Getting URL parameter in java and extract a specific text from that URL

I have a URL and I need to get the value of v from this URL. Here is my URL: http://www.youtube.com/watch?v=_RCIP6OrQrE How can I do that?

java url text-extraction

Kindig asked 31/7, 2012 at 5:14

2

How to extract text from table in image?

I have data which in a structured table image. The data is like below: I tried to extract the text from this image using this code: import pytesseract from PIL import Image value=Image.open("d...

python ocr tesseract text-extraction python-tesseract

Residence asked 17/12, 2019 at 8:55

6

Solved

How to get a substring from string through PHP?

I want to change the displayed username from [email protected] to only abcd. For this, I should clip the part starting from @. I can do this very easily through variablename.substring() functi...

php substring text-extraction

Chymotrypsin asked 22/4, 2011 at 5:40

5

Solved

How to extract all regex matches in a file using Vim?

Consider the following example: case Foo: ... break; case Bar: ... break; case More: case Complex: ... break: ... Say, we would like to retrieve all matches of the regex case $[^:]*$: (the...

regex vim match text-extraction

Alcantar asked 31/1, 2012 at 12:33

1

What does the key values of the dictionary output of the following code in tesseract signify?

I am using the following code in python: I am getting the following key values in the dictionary: 'block_num' 'conf' 'level' 'line_num' 'page_num' 'par_num', 'text', 'top', 'width', 'word_num', '...

python-3.x tesseract text-extraction python-tesseract

Veronica asked 21/6, 2019 at 7:38

6

Solved

Extract filename with extension from filepath string

I am looking to get the filename from the end of a filepath string, say $text = "bob/hello/myfile.zip"; I want to be able to obtain the file name, which I guess would involve getting eve...

php substring filenames filepath text-extraction

Jehanna asked 30/6, 2010 at 14:2

0

How to skip the character causing UnicodeDecodeError: using textract like errors="replace"?

I am trying to convert all readable in a pdf file into a string using textract. It works for most of the files but in some it gives UnicodeDecodeError: I want to skip problematic characters. I hav...

python pdf text-extraction

Alible asked 25/10, 2019 at 11:51

10

Solved

How to extract text from MS office documents in C#

I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI bu...

c#ms-office text-extraction

Mabellemable asked 18/6, 2009 at 7:20

4

Solved

Extract part of string between two different patterns

I try to use stringr package to extract part of a string, which is between two particular patterns. For example, I have: my.string <- "nanaqwertybaba" left.border <- "nana" right.border &l...

regex r text-extraction stringr

Fredia asked 7/4, 2014 at 22:21

4

Solved

Check if two strings contain the same set of words in Python

I am trying to compare two sentences and see if they contain the same set of words. Eg: comparing "today is a good day" and "is today a good day" should return true I am using the Counter function...

python python-2.7 text text-extraction

Willner asked 25/6, 2019 at 20:22

2

Python pdftotext ShellError Using textract

When I run the below Python script on a directory that contains a PDF file, I keep getting this error: ShellError: The command pdftotext "path/to/pdf/title.pdf" - failed with exit code 1 ------...

python pdf text-extraction

Bianco asked 8/4, 2015 at 17:1

2

Solved

Couldn't install textract in google colab

I couldn't install textract in google colab, error message showing as below. some people suggest to use sudo apt-get install libasound2-dev but how to do sudo... in google colab? === error mess...

python google-colaboratory text-extraction

Wester asked 10/1, 2019 at 6:10

0

pdftotext get font information (font-family, style, size)

I'm using "pdftotext -bbox file.pdf" to convert a pdf file into HTML. Here's a sample line from the output: <word xMin="351.852025" yMin="42.548936" xMax="365.689478" yMax="47.681498">foo&l...

text-extraction pdftotext poppler pdf-scraping xpdf

Elegit asked 6/5, 2018 at 11:23

3

Not able to understand coordinate in extracted document using OCR engine tesseract

I have extracted a image document from tesseract and It has extracted successful. But I am not able to understand coordinate of extracted document. Problem description: - It showing coordinates ...

ocr tesseract text-extraction hocr

Incommodious asked 31/8, 2013 at 16:38

1

Solved

Keyword/keyphrase extraction from text [closed]

I am working on a project where I need to extract "technology related keywords/keyphrases" from text. For example, my text is: "ABC Inc has been working on a project related to machine lear...

machine-learning nlp text-mining jnlp text-extraction

Herald asked 13/3, 2018 at 18:28

2

Solved

List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer

I have fitted a CountVectorizer to some documents in scikit-learn. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example...

python machine-learning scikit-learn text-extraction countvectorizer

Ssw asked 18/4, 2013 at 8:27

4

Solved

Extracting whole words

I have a large set of real-world text that I need to pull words out of to input into a spell checker. I'd like to extract as many meaningful words as possible without too much noise. I know there's...

python regex cpu-word alphabetical text-extraction

Alongshore asked 19/4, 2011 at 14:22

8

Solved

How to extract string following a pattern with grep, regex or perl [duplicate]

I have a file that looks something like this: <table name="content_analyzer" primary-key="id"> <type="global" /> </table> <table name="content_analyzer2" primary-key...

regex perl sed html-parsing text-extraction

Toilet asked 22/2, 2011 at 16:34

5

Solved

Extract all hex colors from a multiline CSS string

I'm trying to write regex that extracts all hex colors from CSS code. This is what I have now: Code: $css = <<<CSS /* Do not match me: #abcdefgh; I am longer than needed. */ .foo { c...

php css colors hex text-extraction

Dialectical asked 11/10, 2012 at 10:54

text-extraction Questions

Recommended topics

Hot tags