apache-tika Questions

6

Solved

Can anyone point me to a tutorial. My main experience with Solr is indexing CSV files. But I cannot find any simple instructions/tutorial to tell me what I need to do to index pdfs. I have seen t...
Farnese asked 14/7, 2011 at 13:57

2

Solved

I am setting up a java project where I use pdfBox to get images out of PDF. Since I am using tika-app for my other functions, I decided to go with pdfBox present inside tika-app-1.20.jar. I have t...
Diggings asked 29/8, 2019 at 10:1

4

I am getting all these warnings from Tika when I try to use it: Feb 24, 2018 9:24:35 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: JBIG2ImageReade...
Besmirch asked 25/2, 2018 at 4:16

1

I'm using Apache Tika App on my Ubuntu 16.04 Server as a comand line tool to extract content of documents. The [Apache Tika website][1] says the following: Build artifacts The Tika build con...
Wismar asked 28/7, 2018 at 15:18

2

Solved

I tried converting .doc to HTML by using WordToHtmlConverter and it worked perfectly. But when i tried to convert .docx to HTML, i got stuck with it. What i tried: I used the below code to conve...
Jason asked 9/7, 2014 at 11:51

1

I am testing Apache Tika REST Api via python for parsing HTML files. Everything works except one thing. Interior of <noscript> tags is also parsed as text and I am having some css styling con...
Anhedral asked 22/2, 2019 at 15:0

4

Solved

I am looking at a parser for pdf and MS office document formats to extract tabular information from files. Was thinking of writing separate implementations when I saw Apache Tika. I am able to extr...
Brandiebrandise asked 22/11, 2012 at 16:48

1

Solved

I'm currently using tika to extract the text from pdf files. I found a very fast method within the tika module. This method is called unpack. This is my code: from tika import unpack text = unpa...
Oakman asked 2/11, 2018 at 16:7

2

Apache Tika 1.6 has the ability to extract inline images from PDF documents. However, I've been struggling to get it to work. My use case is that I want some code that will extract the content and...
Crutchfield asked 11/9, 2014 at 8:58

2

Solved

I would have a question concerning analyzing documents. With Apache Tika, it is possible to get content and metadata of different files with different types. Is it also possible to get keywords of...
Sign asked 10/10, 2017 at 9:26

2

I am writing a Topic Modeling program using Apache Tika to extract the text contents from other file type. Actually It run perfectly on Eclipse. But when I export to JAR file to use from command pr...
Ostracod asked 14/3, 2018 at 20:23

1

I am trying to add a custom mime type to Apache Tika. I have the following custom-mimetypes.xml document in org.apache.tika.mime : <?xml version="1.0" encoding="UTF-8"?> <mime-info> ...
Decongestant asked 22/2, 2013 at 3:49

2

Solved

When I use Apache Tika to determine the file type from the content. XML file is fine but not the json. If content type is json, it will return "text/plain" instead of "application/json". Any help?...
Chancellor asked 17/10, 2013 at 6:31

1

Solved

try { File file = new File("Example.pdf"); String content = new Tika().parseToString(file); System.out.println("The Content: " + content); } catch (Exception e) { e.printStackTrace(); } I h...
Aperiodic asked 31/7, 2017 at 11:4

1

Solved

I have a large directory with PDF files (images), how can I extract efficiently the text from all the files inside the directory?. So far I tried to: import multiprocessing import textract def ex...

1

My pdf contains scanned images and I want to extract text from it. What I tried : I tried with AutoDetectParsers but no output. I followed the solution provided in Apache Tika extract scanned PDF...
Geniegenii asked 29/9, 2016 at 6:23

2

I am using Apache Tika to detect the mime type of an input stream and I was wondering if there's a ready method to detect that this file is an executable file, there's a big list of executable file...
Seaboard asked 23/2, 2016 at 5:37

1

Solved

l have thousands of pdf documents that are 11-15mb. My program says that my document contains more than 100k characters. Error output: Exception in thread "main" org.apache.tika.sax.WriteOut...
Natale asked 21/2, 2016 at 22:17

1

I am aware that Oracle notes ZIP/GZIP file compressor/decompressor methods on their website. But I have a scenario where I need to scan and find out whether any nested ZIPs/RARs are involved. For e...
Conchoidal asked 11/2, 2016 at 10:34

0

I'm looking to parse an email .msg or .eml file using Tika. With the code below, I'm able to parse the email along with what is is inside of the attachment. However, I'd like to get the attachment ...
Vibraphone asked 4/11, 2015 at 15:47

2

Solved

I just want to know that how i can extract main text and plain text from html using Tika? maybe one possible solution is to use BoilerPipeContentHandler but do you have some sample/demo codes to s...
Kassel asked 14/5, 2014 at 11:14

1

Solved

I have an application on my Ubuntu 14.04.x Machine. This application does text mining on PDF files. I suspect that it is using Apache Tika etc... The problem is that, during its reading process, I...
Glaze asked 10/9, 2015 at 18:24

1

i'm having some troubles using Apache TIKA (version 1.10). I got some PDF files which are just scanned pieces of paper. That means each page is just an image. My goal is to extract the text of the ...
Metalepsis asked 2/9, 2015 at 13:13

1

Solved

I'm using the Java library Tika by Apache (tika-core ver. 1.10). Exists a org.apache.tika.detect.Detector for CSV files? The MIME type should be text/csv, but I cannot find anything like that. I...
Manhunt asked 21/8, 2015 at 9:34

2

I am looking for a C/C++ alternative for Apache Tika framework which is Java based. Specifically, I am searching for file meatadata and structured text extraction all under one framework. After som...
Dividend asked 3/6, 2011 at 22:11

© 2022 - 2024 — McMap. All rights reserved.