apache-tika Questions

2

I need to use .Net Core and create a console app that uses .NET bindings for Apache Tika. Do you guys have any idea on how to proceed? I found a wrapper called 'TikaOnDotNet' but it only seems to...
Ajit asked 28/2, 2017 at 21:42

2

Solved

I am uploading files to an Amazon s3 bucket and have access to the InputStream and a String containing the MIME Type of the file but not the original file name. It's up to me to actually create the...
Tussah asked 30/11, 2012 at 17:44

2

I'm attempting to use Tika's AutoDetectParser to pull a file's content. I originally thought this was a dependency issue but cannot fathom how that could still be true now that i'm including all of...
Panhandle asked 21/12, 2015 at 20:4

1

Done with : java -jar tika-server-path --port xxxx pip install tika (virtualenv) parser-tika.py import tika from tika import parser parsed = parser.from_file('/path/to/file') print parse...
Rosanne asked 2/10, 2016 at 14:22

7

I am trying to use the tika package to Parse files. Tika is successfully installed, tika-server-1.18.jar runned with Code in cmd Java -jar tika-server-1.18.jar My code in the Jupyter is: import tik...
Cockatrice asked 25/7, 2018 at 8:28

2

I have a large repository of documents in PDF format. The documents come from different sources, and have no one single style. I use Tika to extract the text from the documents, and now I'd like to...
Discomposure asked 23/1, 2017 at 8:16

1

Solved

There are some existing posts out there that talk about "how to detect if a document is password protected". This is probably the most comprehensive of these links for MS Office docs: Det...
Yuzik asked 18/9, 2019 at 17:21

4

For this link http://bits.blogs.nytimes.com/2014/09/02/uber-banned-across-germany-by-frankfurt-court/?partner=rss&emc=rss this code doesn`t work but if I put another for exemple: https://www.go...
Vasta asked 3/9, 2014 at 10:35

5

Solved

Using Solr 3.6 and the ExtractionRequestHandler (aka Tika), is it possible to map just the textual content (of a PDF) to a field minus the metadata? The "content" field produced by Tika unfortunate...
Rask asked 4/6, 2012 at 21:43

2

In order to start a Tika server that can be accessed from hosts other that localhost we know that the way to go is (say I have version 1.7 and want to run on port 9998) java -jar tika-server-1.7-S...
Metabolic asked 2/9, 2014 at 21:59

5

I'm using Tika and I realized that each time the jar file is downloaded and placed in Temp folder Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-se...
Abdel asked 12/6, 2019 at 10:20

4

Solved

The PDF in this link (http://www.lenovo.com/psref/pdf/psref450.pdf) contains a number of tables like this: I'd like to programmatically extract the data and the structure from these tables. Thi...
Dunaj asked 24/3, 2014 at 21:40

3

I'm parsing large pdf and word documents using Tika but I get he followiing error message. Your document contained more than 100000 characters, and so your requested limit has been reached. To rec...
Gaston asked 26/6, 2015 at 18:2

1

I am using Apache Tika Parser to parse PDF files into text. Some PDFs could contain scanned documents. Apache Tika uses Tesseract to recognize a text into images. But there is no jar library with T...
Chromolithography asked 16/9, 2017 at 12:24

6

Solved

I am trying to create a simple java program which reads and extracts the content from the file(s) inside zip file. Zip file contains 3 files (txt, pdf, docx). I need to read the contents of all the...
Vincentvincenta asked 27/3, 2013 at 18:54

0

I'm trying to setup Tika for text extraction using python. I've installed Java runtime jre 1.8.0, Installed tika with pip install tika==1.23, Downloaded the tika server jar file from this link, and...
Pessa asked 2/3, 2021 at 18:36

2

Solved

I have many folders where each has a couple of pdf files (other file types like .xlsx or .doc are there as well). My goal is to extract the pdf's text for each folder and create a data frame where ...
Treharne asked 16/2, 2021 at 12:47

0

I am attempting to Tika parse dozens of millions of office documents. Pdfs, docs, excels, xmls, etc. Wide assortment of types. Throughput is very important. I need to be able parse these files in a...
Meganmeganthropus asked 22/11, 2020 at 5:27

3

I am trying to extract the metatags of HTML files and indexing them into solr with tika integration. I am not able to extract those metatags with Tika and not able to display in solr. My HTML file...
Bergren asked 21/2, 2013 at 15:25

1

I am utilizing solr ExtractingRequestHandler to extract and index HTML content. My issue comes to the extracted links section that it produces. The extracted content returned has "rect" inserted wh...
Unfurl asked 4/3, 2014 at 17:21

0

I have bunch of pdf files (from different sources) and I'd like to extract text from them (unfortunately can't attach the files). Current parsing outcome: Tika silently returns text, which is miss...
Monumental asked 6/8, 2020 at 4:17

4

This is my class for reading mime types. I am trying to add a new mime type(properties file) and read it. This is my class file: /* * To change this license header, choose License Headers in Pr...
Pish asked 17/6, 2015 at 15:19

4

Solved

I'm using Apache Tika, and I have files (without extension) of particular content type that need to be renamed to have extension that reflect the content type. Any idea if there is something I cou...
Incredible asked 4/4, 2011 at 16:48

1

Solved

The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy. e.g: $ curl -T test/Dokument01...
Blowhard asked 25/5, 2020 at 21:26

5

I'm trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius package (using this tutor...
Greengrocer asked 12/10, 2015 at 5:39

© 2022 - 2024 — McMap. All rights reserved.