apache-tika Questions
2
I need to use .Net Core and create a console app that uses .NET bindings for Apache Tika. Do you guys have any idea on how to proceed?
I found a wrapper called 'TikaOnDotNet' but it only seems to...
Ajit asked 28/2, 2017 at 21:42
2
Solved
I am uploading files to an Amazon s3 bucket and have access to the InputStream and a String containing the MIME Type of the file but not the original file name. It's up to me to actually create the...
Tussah asked 30/11, 2012 at 17:44
2
I'm attempting to use Tika's AutoDetectParser to pull a file's content.
I originally thought this was a dependency issue but cannot fathom how that could still be true now that i'm including all of...
Panhandle asked 21/12, 2015 at 20:4
1
Done with :
java -jar tika-server-path --port xxxx
pip install tika (virtualenv)
parser-tika.py
import tika
from tika import parser
parsed = parser.from_file('/path/to/file')
print parse...
Rosanne asked 2/10, 2016 at 14:22
7
I am trying to use the tika package to Parse files. Tika is successfully installed, tika-server-1.18.jar runned with Code in cmd Java -jar tika-server-1.18.jar
My code in the Jupyter is:
import tik...
Cockatrice asked 25/7, 2018 at 8:28
2
I have a large repository of documents in PDF format. The documents come from different sources, and have no one single style. I use Tika to extract the text from the documents, and now I'd like to...
Discomposure asked 23/1, 2017 at 8:16
1
Solved
There are some existing posts out there that talk about "how to detect if a document is password protected".
This is probably the most comprehensive of these links for MS Office docs: Det...
Yuzik asked 18/9, 2019 at 17:21
4
For this link http://bits.blogs.nytimes.com/2014/09/02/uber-banned-across-germany-by-frankfurt-court/?partner=rss&emc=rss this code doesn`t work but if I put another for exemple: https://www.go...
Vasta asked 3/9, 2014 at 10:35
5
Solved
Using Solr 3.6 and the ExtractionRequestHandler (aka Tika), is it possible to map just the textual content (of a PDF) to a field minus the metadata? The "content" field produced by Tika unfortunate...
Rask asked 4/6, 2012 at 21:43
2
In order to start a Tika server that can be accessed from hosts other that localhost we know that the way to go is (say I have version 1.7 and want to run on port 9998)
java -jar tika-server-1.7-S...
Metabolic asked 2/9, 2014 at 21:59
5
I'm using Tika and I realized that each time the jar file is downloaded and placed in Temp folder
Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-se...
Abdel asked 12/6, 2019 at 10:20
4
Solved
The PDF in this link (http://www.lenovo.com/psref/pdf/psref450.pdf) contains a number of tables like this:
I'd like to programmatically extract the data and the structure from these tables.
Thi...
Dunaj asked 24/3, 2014 at 21:40
3
I'm parsing large pdf and word documents using Tika but I get he followiing error message.
Your document contained more than 100000 characters, and so your requested limit has been reached. To rec...
Gaston asked 26/6, 2015 at 18:2
1
I am using Apache Tika Parser to parse PDF files into text. Some PDFs could contain scanned documents. Apache Tika uses Tesseract to recognize a text into images. But there is no jar library with T...
Chromolithography asked 16/9, 2017 at 12:24
6
Solved
I am trying to create a simple java program which reads and extracts the content from the file(s) inside zip file. Zip file contains 3 files (txt, pdf, docx). I need to read the contents of all the...
Vincentvincenta asked 27/3, 2013 at 18:54
0
I'm trying to setup Tika for text extraction using python. I've installed Java runtime jre 1.8.0, Installed tika with pip install tika==1.23, Downloaded the tika server jar file from this link, and...
Pessa asked 2/3, 2021 at 18:36
2
Solved
I have many folders where each has a couple of pdf files (other file types like .xlsx or .doc are there as well). My goal is to extract the pdf's text for each folder and create a data frame where ...
Treharne asked 16/2, 2021 at 12:47
0
I am attempting to Tika parse dozens of millions of office documents. Pdfs, docs, excels, xmls, etc. Wide assortment of types.
Throughput is very important. I need to be able parse these files in a...
Meganmeganthropus asked 22/11, 2020 at 5:27
3
I am trying to extract the metatags of HTML files and indexing them into solr with tika integration. I am not able to extract those metatags with Tika and not able to display in solr.
My HTML file...
Bergren asked 21/2, 2013 at 15:25
1
I am utilizing solr ExtractingRequestHandler to extract and index HTML content. My issue comes to the extracted links section that it produces. The extracted content returned has "rect" inserted wh...
Unfurl asked 4/3, 2014 at 17:21
0
I have bunch of pdf files (from different sources) and I'd like to extract text from them (unfortunately can't attach the files).
Current parsing outcome:
Tika silently returns text, which is miss...
Monumental asked 6/8, 2020 at 4:17
4
This is my class for reading mime types. I am trying to add a new mime type(properties file) and read it.
This is my class file:
/*
* To change this license header, choose License Headers in Pr...
Pish asked 17/6, 2015 at 15:19
4
Solved
I'm using Apache Tika, and I have files (without extension) of particular content type that need to be renamed to have extension that reflect the content type.
Any idea if there is something I cou...
Incredible asked 4/4, 2011 at 16:48
1
Solved
The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy. e.g:
$ curl -T test/Dokument01...
Blowhard asked 25/5, 2020 at 21:26
5
I'm trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius package (using this tutor...
Greengrocer asked 12/10, 2015 at 5:39
1 Next >
© 2022 - 2024 — McMap. All rights reserved.