apache-tika - 3

4

Solved

Getting MimeType subtype with Apache tika

I'd need to get the iana.org MediaType rather than application/zip or application/x-tika-msoffice for documents like, odt, ppt, pptx, xlsx etc. If you look at mimetypes.xml there are mimeType elem...

java mime-types detection apache-tika

Medor asked 21/8, 2011 at 10:14

3

How to use Tika in server mode

On Tika's website it says (concerning tika-app-1.2.jar) it can be used in server mode. Does anyone know how to send documents and receive parsed text from this server once it is running?

apache-tika

Herzberg asked 1/9, 2012 at 21:39

1

Solved

Apache tika: remove extra line breaks in result string

I have html file: <html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;"> <div>Test message.</div> <div>&nbsp;</div&gt...

java apache-tika

Stint asked 4/7, 2013 at 17:26

2

Solved

How can I use the HTML parser with Apache Tika in Java to extract all HTML tags?

I download tika-core and tika-parser libraries, but I could not find the example codes to parse HTML documents to string. I have to get rid of all html tags of source of a web page. What can I do? ...

java html apache apache-tika

Mcauley asked 25/3, 2011 at 7:47

1

Using POI or Tika to extract text, stream-to-stream without loading the entire file in memory

I'm trying to use either Apache POI and PDFBox by themselves, or within the context of Apache Tika, to extract and process plain text from MASSIVE Microsoft Office and PDF files (i.e. hundreds of m...

java ms-office apache-poi pdfbox apache-tika

Epicycle asked 30/7, 2014 at 17:57

1

Solved

Correct use of Apache Tika MediaType

I want to use APache Tika's MediaType class to compare mediaTypes. I first use Tika to detect the MediaType. Then I want to start an action according to the MediaType. So if the MediaType is from...

content-type apache-tika media-type

Lahdidah asked 20/4, 2014 at 6:51

0

Handle ligatures in Apache Tika

Tika doesn't seem to recognize ligatures (fi, ff, fl...) in PDF files and replaces them with question marks. Any idea (not only on Tika) to extract PDF text while converting character ligatures to...

java pdf character-encoding apache-tika ligature

Overissue asked 12/3, 2014 at 10:30

2

Solved

Files locked after indexing

I have the following workflow in my (web)application: download a pdf file from an archive index the file delete the file My problem is that after indexing the file, it remains locked and the de...

solrj solr4 apache-tika

Alemannic asked 26/2, 2014 at 8:33

2

Solved

Difference between Apache POI api and Apache Tika Api?

I had requirement to extract specific colums/rows from Excel/CSV file. Somebody suggest me to using Tika for this task. While going thru tika, I came across POI API and found more friendly to use...

java apache-poi apache-tika

Inception asked 19/9, 2013 at 6:47

2

Is Apache Tika able to extract foreign languages like Chinese, Japanese?

Is Apache Tika able to extract foreign languages like Chinese, Japanese? I have the following code: Detector detector = new DefaultDetector(); Parser parser = new AutoDetectParser(detector); ...

apache apache-tika

Cretic asked 26/3, 2013 at 13:58

1

Eclipse Juno EE NoClassDefFoundError when using external Jar

I added an external jar in my eclipse dynamic webproject via Folder -> properties -> build path -> Libraries -> add external jar. The code is working fine on compile time. package servlet; impor...

apache jakarta-ee eclipse-juno apache-tika

Ban asked 4/11, 2012 at 10:8

3

Solved

Is it possible to extract text by page for word/pdf files using Apache Tika?

All the documentation I can find seems to suggest I can only extract the entire file's content. But I need to extract pages individually. Do I need to write my own parser for that? Is there some ob...

text apache-tika

Twylatwyman asked 28/4, 2011 at 20:53

1

Solr for Arabic PDF's

I am trying to search arabic PDFs in Apache Solr. The problem appears to be that Tika indexes the PDF in reverse order (Left-to-right) instead of (Right-to-left). I have found references about th...

drupal solr arabic right-to-left apache-tika

Herein asked 27/11, 2012 at 17:27

3

Solved

unable to configure Tika1.2 with solr4

I am trying to use TikaEntityProcessor to index the .html file content. Somehow I am not able to get it correctly. I have checked the error log and I got the following error. SEVERE: Full Import f...

solr apache-tika dataimporthandler solr4

Pandean asked 11/2, 2013 at 15:55

2

Solved

PDFBox adding white spaces within words

When I try to extract text from my PDF files, it seems to insert white spaces between severl words randomly. I am using pdfbox-app-1.6.0.jar (latest version) on following sample file in Downloads ...

solr lucene pdfbox apache-tika

Periclean asked 31/10, 2011 at 14:6

1

Customising the search algorithm of Elasticsearch

I originally tried posting a similar post to the elasticsearch mailing list (https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/BZLFJSEpl78) but didn't get any helpful responses so I...

java lucene elasticsearch apache-tika

Cork asked 2/10, 2012 at 0:25

2

Solved

How to detect image in a document

How can I detect images in a document say doc,xls,ppt or pdf ? I came across with Apache Tika, I am trying its command line option. http://tika.apache.org/1.2/gettingstarted.html But not quite s...

apache apache-tika

Brindisi asked 13/8, 2012 at 10:45

1

Solved

Spring & Tika integration: is my approach thread-safe?

I'm interested in Spring & Apache Tika integration. Is this approach thread-safe? <bean id="tika" class="org.apache.tika.Tika"/> Can I safely call detect() method from different thread...

spring thread-safety apache-tika

Manhole asked 17/4, 2012 at 12:11

2

Solved

Elasticsearch Parse Exception error when attempting to index PDF

I'm just getting started with elasticsearch. Our requirement has us needing to index thousands of PDF files and I'm having a hard time getting just ONE of them to index successfully. Installed the...

pdf base64 elasticsearch apache-tika osx-server

Zahara asked 13/6, 2012 at 14:50

1

How to parse arabic pdf with Tika

I've installed tika with solr , and it's working well for arabic pdf , is there any tutorial to make this happen , I've seen a similar question to this and the solution was to include ICU4J.jar , b...

solr arabic apache-tika

Scathing asked 9/4, 2012 at 17:14

4

Solved

Extract the text from URLs using TIKA

Is it possible to extract text from URLs with Tika? Any links will be appreciated. Or TIKA is usable only for pdf, word and any other media documents?

java apache-tika

Loidaloin asked 11/7, 2011 at 21:30

1

Solved

how can I detect farsi web pages by tika?

I need a sample code to help me detect farsi language web pages by apache tika toolkit. LanguageIdentifier identifier = new LanguageIdentifier("فارسی"); String language = identifier.getLanguage(...

java apache apache-tika language-detection farsi

Dissimilar asked 28/1, 2012 at 11:30

3

How do I index documents in SOLR?

Im running Solr 1.4 on Ubuntu 10.04 (installed via apt-get solr-tomcat) and it seems to be working fine. Im having some difficulty finding any coherent info on how to index documents though. Im new...

solr full-text-search apache-tika solr-cell

Robbyrobbyn asked 10/5, 2010 at 10:48

1

Solved

Solr : data import handler and solr cell

Is it possible to index rich document (pdf, office)... with data import handler using solr cell. I use solr 3.2. Thanks.

solr apache-tika dataimporthandler solr-cell

Sherylsheryle asked 13/7, 2011 at 8:38

1

tika solr integration

I am trying to index using curl based request the request is curl "http://localhost:8080/solr1/update/extract?literal.id=who.pdf&uprefix=attr_&fmap.content=attr_content&commit=true" -...

solr full-text-search apache-tika solr-cell

Hertel asked 31/5, 2011 at 11:28

apache-tika Questions

Recommended topics

Hot tags