apache-tika Questions

4

Solved

I'd need to get the iana.org MediaType rather than application/zip or application/x-tika-msoffice for documents like, odt, ppt, pptx, xlsx etc. If you look at mimetypes.xml there are mimeType elem...
Medor asked 21/8, 2011 at 10:14

3

On Tika's website it says (concerning tika-app-1.2.jar) it can be used in server mode. Does anyone know how to send documents and receive parsed text from this server once it is running?
Herzberg asked 1/9, 2012 at 21:39

1

Solved

I have html file: <html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;"> <div>Test message.</div> <div> </div&gt...
Stint asked 4/7, 2013 at 17:26

2

Solved

I download tika-core and tika-parser libraries, but I could not find the example codes to parse HTML documents to string. I have to get rid of all html tags of source of a web page. What can I do? ...
Mcauley asked 25/3, 2011 at 7:47

1

I'm trying to use either Apache POI and PDFBox by themselves, or within the context of Apache Tika, to extract and process plain text from MASSIVE Microsoft Office and PDF files (i.e. hundreds of m...
Epicycle asked 30/7, 2014 at 17:57

1

Solved

I want to use APache Tika's MediaType class to compare mediaTypes. I first use Tika to detect the MediaType. Then I want to start an action according to the MediaType. So if the MediaType is from...
Lahdidah asked 20/4, 2014 at 6:51

0

Tika doesn't seem to recognize ligatures (fi, ff, fl...) in PDF files and replaces them with question marks. Any idea (not only on Tika) to extract PDF text while converting character ligatures to...
Overissue asked 12/3, 2014 at 10:30

2

Solved

I have the following workflow in my (web)application: download a pdf file from an archive index the file delete the file My problem is that after indexing the file, it remains locked and the de...
Alemannic asked 26/2, 2014 at 8:33

2

Solved

I had requirement to extract specific colums/rows from Excel/CSV file. Somebody suggest me to using Tika for this task. While going thru tika, I came across POI API and found more friendly to use...
Inception asked 19/9, 2013 at 6:47

2

Is Apache Tika able to extract foreign languages like Chinese, Japanese? I have the following code: Detector detector = new DefaultDetector(); Parser parser = new AutoDetectParser(detector); ...
Cretic asked 26/3, 2013 at 13:58

1

I added an external jar in my eclipse dynamic webproject via Folder -> properties -> build path -> Libraries -> add external jar. The code is working fine on compile time. package servlet; impor...
Ban asked 4/11, 2012 at 10:8

3

Solved

All the documentation I can find seems to suggest I can only extract the entire file's content. But I need to extract pages individually. Do I need to write my own parser for that? Is there some ob...
Twylatwyman asked 28/4, 2011 at 20:53

1

I am trying to search arabic PDFs in Apache Solr. The problem appears to be that Tika indexes the PDF in reverse order (Left-to-right) instead of (Right-to-left). I have found references about th...
Herein asked 27/11, 2012 at 17:27

3

Solved

I am trying to use TikaEntityProcessor to index the .html file content. Somehow I am not able to get it correctly. I have checked the error log and I got the following error. SEVERE: Full Import f...
Pandean asked 11/2, 2013 at 15:55

2

Solved

When I try to extract text from my PDF files, it seems to insert white spaces between severl words randomly. I am using pdfbox-app-1.6.0.jar (latest version) on following sample file in Downloads ...
Periclean asked 31/10, 2011 at 14:6

1

I originally tried posting a similar post to the elasticsearch mailing list (https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/BZLFJSEpl78) but didn't get any helpful responses so I...
Cork asked 2/10, 2012 at 0:25

2

Solved

How can I detect images in a document say doc,xls,ppt or pdf ? I came across with Apache Tika, I am trying its command line option. http://tika.apache.org/1.2/gettingstarted.html But not quite s...
Brindisi asked 13/8, 2012 at 10:45

1

Solved

I'm interested in Spring & Apache Tika integration. Is this approach thread-safe? <bean id="tika" class="org.apache.tika.Tika"/> Can I safely call detect() method from different thread...
Manhole asked 17/4, 2012 at 12:11

2

Solved

I'm just getting started with elasticsearch. Our requirement has us needing to index thousands of PDF files and I'm having a hard time getting just ONE of them to index successfully. Installed the...
Zahara asked 13/6, 2012 at 14:50

1

I've installed tika with solr , and it's working well for arabic pdf , is there any tutorial to make this happen , I've seen a similar question to this and the solution was to include ICU4J.jar , b...
Scathing asked 9/4, 2012 at 17:14

4

Solved

Is it possible to extract text from URLs with Tika? Any links will be appreciated. Or TIKA is usable only for pdf, word and any other media documents?
Loidaloin asked 11/7, 2011 at 21:30

1

Solved

I need a sample code to help me detect farsi language web pages by apache tika toolkit. LanguageIdentifier identifier = new LanguageIdentifier("فارسی"); String language = identifier.getLanguage(...
Dissimilar asked 28/1, 2012 at 11:30

3

Im running Solr 1.4 on Ubuntu 10.04 (installed via apt-get solr-tomcat) and it seems to be working fine. Im having some difficulty finding any coherent info on how to index documents though. Im new...
Robbyrobbyn asked 10/5, 2010 at 10:48

1

Solved

Is it possible to index rich document (pdf, office)... with data import handler using solr cell. I use solr 3.2. Thanks.
Sherylsheryle asked 13/7, 2011 at 8:38

1

I am trying to index using curl based request the request is curl "http://localhost:8080/solr1/update/extract?literal.id=who.pdf&uprefix=attr_&fmap.content=attr_content&commit=true" -...
Hertel asked 31/5, 2011 at 11:28

© 2022 - 2024 — McMap. All rights reserved.