text-extraction Questions
2
Solved
We are using PDFBox to extract text from PDF's.
Some PDF's text can't be extract correctly.
The following image shows a part from the PDF as image:
After text extraction we get the following te...
Camel asked 10/4, 2015 at 6:1
2
I am working on reading the highlighted from PDF document using PDBox. I was able to read the highlighted text in single line both single and multiple words. However, I could not read the highlight...
Arlinda asked 16/9, 2015 at 12:3
0
I'm trying to setup Tika for text extraction using python. I've installed Java runtime jre 1.8.0, Installed tika with pip install tika==1.23, Downloaded the tika server jar file from this link, and...
Pessa asked 2/3, 2021 at 18:36
4
Solved
i want to detect text area from image as a preprocessing step for tesseract OCR engine, the engine works well when the input is text only but when the input image contains Nontext content it falls,...
Rakia asked 18/4, 2012 at 9:25
14
Solved
I have a string that has two single quotes in it, the ' character. In between the single quotes is the data I want.
How can I write a regex to extract "the data i want" from the following text?
m...
Osiris asked 11/1, 2011 at 20:22
3
I want to extract text under specific headings from a pdf using python.
For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Su...
Albanese asked 5/1, 2018 at 5:19
13
Solved
I have a URL and I need to get the value of v from this URL.
Here is my URL: http://www.youtube.com/watch?v=_RCIP6OrQrE
How can I do that?
Kindig asked 31/7, 2012 at 5:14
2
I have data which in a structured table image. The data is like below:
I tried to extract the text from this image using this code:
import pytesseract
from PIL import Image
value=Image.open("d...
Residence asked 17/12, 2019 at 8:55
6
Solved
I want to change the displayed username from [email protected] to only abcd.
For this, I should clip the part starting from @.
I can do this very easily through variablename.substring() functi...
Chymotrypsin asked 22/4, 2011 at 5:40
5
Solved
Consider the following example:
case Foo:
...
break;
case Bar:
...
break;
case More: case Complex:
...
break:
...
Say, we would like to retrieve all matches of the regex case \([^:]*\): (the...
Alcantar asked 31/1, 2012 at 12:33
1
I am using the following code in python:
I am getting the following key values in the dictionary:
'block_num' 'conf' 'level' 'line_num' 'page_num' 'par_num', 'text', 'top', 'width', 'word_num', '...
Veronica asked 21/6, 2019 at 7:38
6
Solved
I am looking to get the filename from the end of a filepath string, say
$text = "bob/hello/myfile.zip";
I want to be able to obtain the file name, which I guess would involve getting eve...
Jehanna asked 30/6, 2010 at 14:2
0
I am trying to convert all readable in a pdf file into a string using textract. It works for most of the files but in some it gives UnicodeDecodeError: I want to skip problematic characters.
I hav...
Alible asked 25/10, 2019 at 11:51
10
Solved
I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents?
I tried to use NPOI bu...
Mabellemable asked 18/6, 2009 at 7:20
4
Solved
I try to use stringr package to extract part of a string, which is between two particular patterns.
For example, I have:
my.string <- "nanaqwertybaba"
left.border <- "nana"
right.border &l...
Fredia asked 7/4, 2014 at 22:21
4
Solved
I am trying to compare two sentences and see if they contain the same set of words.
Eg: comparing "today is a good day" and "is today a good day" should return true
I am using the Counter function...
Willner asked 25/6, 2019 at 20:22
2
When I run the below Python script on a directory that contains a PDF file, I keep getting this error:
ShellError: The command pdftotext "path/to/pdf/title.pdf" - failed with exit code 1
------...
Bianco asked 8/4, 2015 at 17:1
2
Solved
I couldn't install textract in google colab, error message showing as below.
some people suggest to use sudo apt-get install libasound2-dev but how to do sudo... in google colab?
=== error mess...
Wester asked 10/1, 2019 at 6:10
0
I'm using "pdftotext -bbox file.pdf" to convert a pdf file into HTML.
Here's a sample line from the output:
<word xMin="351.852025" yMin="42.548936" xMax="365.689478"
yMax="47.681498">foo&l...
Elegit asked 6/5, 2018 at 11:23
3
I have extracted a image document from tesseract and It has extracted successful. But I am not able to understand coordinate of extracted document.
Problem description: -
It showing coordinates ...
Incommodious asked 31/8, 2013 at 16:38
1
Solved
I am working on a project where I need to extract "technology related keywords/keyphrases" from text. For example, my text is:
"ABC Inc has been working on a project related to machine lear...
Herald asked 13/3, 2018 at 18:28
2
Solved
I have fitted a CountVectorizer to some documents in scikit-learn. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example...
Ssw asked 18/4, 2013 at 8:27
4
Solved
I have a large set of real-world text that I need to pull words out of to input into a spell checker. I'd like to extract as many meaningful words as possible without too much noise. I know there's...
Alongshore asked 19/4, 2011 at 14:22
8
Solved
I have a file that looks something like this:
<table name="content_analyzer" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer2" primary-key...
Toilet asked 22/2, 2011 at 16:34
5
Solved
I'm trying to write regex that extracts all hex colors from CSS code.
This is what I have now:
Code:
$css = <<<CSS
/* Do not match me: #abcdefgh; I am longer than needed. */
.foo
{
c...
Dialectical asked 11/10, 2012 at 10:54
© 2022 - 2024 — McMap. All rights reserved.