apache-tika - McMap

2

I need to use .Net Core and create a console app that uses .NET bindings for Apache Tika. Do you guys have any idea on how to proceed? I found a wrapper called 'TikaOnDotNet' but it only seems to...

.net .net-core apache-tika

Ajit asked 28/2, 2017 at 21:42

2

Solved

How to determine appropriate file extension from MIME Type in Java

I am uploading files to an Amazon s3 bucket and have access to the InputStream and a String containing the MIME Type of the file but not the original file name. It's up to me to actually create the...

java amazon-s3 apache-tika

Tussah asked 30/11, 2012 at 17:44

2

Tika AutoDetectParser returning empty string?

I'm attempting to use Tika's AutoDetectParser to pull a file's content. I originally thought this was a dependency issue but cannot fathom how that could still be true now that i'm including all of...

java ant apache-tika

Panhandle asked 21/12, 2015 at 20:4

1

ImportError: cannot import name parser with tika-python

Done with : java -jar tika-server-path --port xxxx pip install tika (virtualenv) parser-tika.py import tika from tika import parser parsed = parser.from_file('/path/to/file') print parse...

python apache-tika

Rosanne asked 2/10, 2016 at 14:22

7

Use tika with python, runtimeerror: unable to start tika server

I am trying to use the tika package to Parse files. Tika is successfully installed, tika-server-1.18.jar runned with Code in cmd Java -jar tika-server-1.18.jar My code in the Jupyter is: import tik...

python parsing apache-tika

Cockatrice asked 25/7, 2018 at 8:28

2

Paragraph Segmentation using Machine Learning

I have a large repository of documents in PDF format. The documents come from different sources, and have no one single style. I use Tika to extract the text from the documents, and now I'd like to...

python machine-learning nlp apache-tika text-segmentation

Discomposure asked 23/1, 2017 at 8:16

1

Solved

Detect if file is password protected without loading it into memory?

There are some existing posts out there that talk about "how to detect if a document is password protected". This is probably the most comprehensive of these links for MS Office docs: Det...

java apache-tika

Yuzik asked 18/9, 2019 at 17:21

4

java.lang.IllegalArgumentException: protocol = http host = null

For this link http://bits.blogs.nytimes.com/2014/09/02/uber-banned-across-germany-by-frankfurt-court/?partner=rss&emc=rss this code doesn`t work but if I put another for exemple: https://www.go...

java url apache-tika

Vasta asked 3/9, 2014 at 10:35

5

Solved

textual content without metadata from Tika via SolrCell

Using Solr 3.6 and the ExtractionRequestHandler (aka Tika), is it possible to map just the textual content (of a PDF) to a field minus the metadata? The "content" field produced by Tika unfortunate...

solr apache-tika solr-cell

Rask asked 4/6, 2012 at 21:43

2

Stopping a Tika server properly

In order to start a Tika server that can be accessed from hosts other that localhost we know that the way to go is (say I have version 1.7 and want to run on port 9998) java -jar tika-server-1.7-S...

java apache-tika

Metabolic asked 2/9, 2014 at 21:59

5

python how to use tika with existing jar file without downloading again

I'm using Tika and I realized that each time the jar file is downloaded and placed in Temp folder Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-se...

python apache-tika

Abdel asked 12/6, 2019 at 10:20

4

Solved

parse tables from a PDF document

The PDF in this link (http://www.lenovo.com/psref/pdf/psref450.pdf) contains a number of tables like this: I'd like to programmatically extract the data and the structure from these tables. Thi...

python parsing pdf pdfbox apache-tika

Dunaj asked 24/3, 2014 at 21:40

3

How to read large files using TIka?

I'm parsing large pdf and word documents using Tika but I get he followiing error message. Your document contained more than 100000 characters, and so your requested limit has been reached. To rec...

apache-tika

Gaston asked 26/6, 2015 at 18:2

1

Apache Tika OCR without Tesseract installing

I am using Apache Tika Parser to parse PDF files into text. Some PDFs could contain scanned documents. Apache Tika uses Tesseract to recognize a text into images. But there is no jar library with T...

java ocr tesseract apache-tika

Chromolithography asked 16/9, 2017 at 12:24

6

Solved

Read Content from Files which are inside Zip file

I am trying to create a simple java program which reads and extracts the content from the file(s) inside zip file. Zip file contains 3 files (txt, pdf, docx). I need to read the contents of all the...

java zip extract apache-tika

Vincentvincenta asked 27/3, 2013 at 18:54

0

Tika server returned status: 404

I'm trying to setup Tika for text extraction using python. I've installed Java runtime jre 1.8.0, Installed tika with pip install tika==1.23, Downloaded the tika server jar file from this link, and...

java python apache-tika text-extraction tika-server

Pessa asked 2/3, 2021 at 18:36

2

Solved

How to extract text from pdfs in folders with python and save them in dataframe?

I have many folders where each has a couple of pdf files (other file types like .xlsx or .doc are there as well). My goal is to extract the pdf's text for each folder and create a data frame where ...

python dataframe pdf apache-tika pdf-conversion

Treharne asked 16/2, 2021 at 12:47

0

How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

I am attempting to Tika parse dozens of millions of office documents. Pdfs, docs, excels, xmls, etc. Wide assortment of types. Throughput is very important. I need to be able parse these files in a...

java kubernetes apache-tika tika-server

Meganmeganthropus asked 22/11, 2020 at 5:27

3

How to extract metatags from HTML files and index them in SOLR and TIKA

I am trying to extract the metatags of HTML files and indexing them into solr with tika integration. I am not able to extract those metatags with Tika and not able to display in solr. My HTML file...

solr apache-tika data-import solr4

Bergren asked 21/2, 2013 at 15:25

1

Solr ExtractingRequestHandler extracting "rect" in links

I am utilizing solr ExtractingRequestHandler to extract and index HTML content. My issue comes to the extracted links section that it produces. The extracted content returned has "rect" inserted wh...

solr apache-tika solr-cell

Unfurl asked 4/3, 2014 at 17:21

0

Regarding No Unicode mapping error while parsing pdf

I have bunch of pdf files (from different sources) and I'd like to extract text from them (unfortunately can't attach the files). Current parsing outcome: Tika silently returns text, which is miss...

parsing unicode pdfbox apache-tika pdf-parsing

Monumental asked 6/8, 2020 at 4:17

4

How to add new mime type to apache tika

This is my class for reading mime types. I am trying to add a new mime type(properties file) and read it. This is my class file: /* * To change this license header, choose License Headers in Pr...

java apache-tika

Pish asked 17/6, 2015 at 15:19

4

Solved

How to get file extension from content type?

I'm using Apache Tika, and I have files (without extension) of particular content type that need to be renamed to have extension that reflect the content type. Any idea if there is something I cou...

java content-type apache-tika

Incredible asked 4/4, 2011 at 16:48

1

Solved

Apache Tika Server - Request Header Parameters?

The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy. e.g: $ curl -T test/Dokument01...

apache-tika tika-server

Blowhard asked 25/5, 2020 at 21:26

5

How can I use Tika package(https://github.com/chrismattmann/tika-python) in python(2.7) to parse PDF files?

I'm trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius package (using this tutor...

python parsing pdf apache-tika

Greengrocer asked 12/10, 2015 at 5:39

apache-tika Questions

Recommended topics

Hot tags