How can I use Tika package(https://github.com/chrismattmann/tika-python) in python(2.7) to parse PDF files?
Asked Answered
G

5

3

I'm trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius package (using this tutorial here: http://www.hackzine.org/using-apache-tika-from-python-with-jnius.html) but the code throws an error.

Using the TIKA package however I was able to pass files and parse them but Python is only able to extract metadata and when asked to parse content, Python returns output "none". It is able to perfectly parse .txt files but fails for content extraction for PDFs. Here's the code

import tika
tika.initVM()
from tika import parser
parsed = parser.from_file('/path/to/file')
print parsed["metadata"]
print parsed["content"]

Do I require additional packages/codelines to be able to extract the data?

Greengrocer answered 12/10, 2015 at 5:39 Comment(5)
Is there actually any text in your PDFs? Computers are dumb. What looks like text for you, me, and everyone else, may be just a couple of random lines to a computer.Antagonist
The text that exists in the PDFs has been scanned in and does not exist as actual characters. Essentially it is a just labels included on a typical engineering drawing(much like this one: 7-plus-ngm.org/bilder/piston.jpg) I need to be able to extract the label data, description tables and notes included in the example imageGreengrocer
Then you cannot use a general text extractor; you must use OCR here (Optical Character Recognition).Antagonist
NOTE: I tried passing PDFs that contain only text, even .doc files converted to .pdf and the code still returns "None" as an output for comment. So I wonder if there is something wrong with the package itself and requires other dependencies to make it work properly?Greengrocer
Apache Tika supports OCR'ing text, if you have the right tools installed. Do you try following the Tika OCR setup instructions?Rudie
C
20

You need to download the Tika Server Jar and run it first. Check this link: http://wiki.apache.org/tika/TikaJAXRS

  1. Download the Jar
  2. Store it somewhere and run it as java -jar tika-server-x.x.jar --port xxxx
  3. In your Code you now don't need to do the tika.initVM() Add tika.TikaClientOnly = True instead of tika.initVM()
  4. Change parsed = parser.from_file('/path/to/file') to parsed = parser.from_file('/path/to/file', '/path/to/server') You will get the server path in Step 2. when the tika server initiates - just plug that in here

Good luck!

Cheat answered 14/4, 2016 at 16:16 Comment(0)
C
6

can you please share the file you are looking at? The easiest way to do this would be to perhaps attach it to a Github issue in my repository, etc.

That said, if you are trying to use OCR and Tika, you need to run through the Tika OCR guide (http://wiki.apache.org/tika/TikaOCR) and get Tesseract installed. Once Tesseract is installed, then you need to double check whether or not you have an instance of tika-server running (e.g., ps aux | grep tika). If you do, kill it (tika-python runs the Tika REST server in the background as its main interface to Tika; having a fresh running version of it after Tesseract OCR is installed helps to eliminate any odd possibilities).

After you have Tesseract OCR installed, no tika-server running, start your python2.7 interpreter (or script), and then do something like:

from tika import parser
parsed = parser.from_file('/path/to/file')
print parsed["content"] # should be the text returned from OCR

HTH! --Chris

Contrapositive answered 13/10, 2015 at 0:11 Comment(1)
This solution works. I would like to give one more tip. Please use tika version 1.19 using the command pip install tika=1.19. I ran into the following problem when using a new version (1.22). File "C:\Python36\lib\site-packages\tika\tika.py", line 546, in callServer encodedData.close() # closes the file reading data AttributeError: 'bytes' object has no attribute 'close'Gerta
A
2

I never tried python tikq , but pyjnius is working fine for me. Here is my code:

def parse_file(filename):
   """
   Import TIKA classes and parse input filename
   """

   import os
   os.environ['CLASSPATH'] = "/path/to/tika-app.jar"
   from jnius import autoclass
   from jnius import JavaException

   # Import the Java classes 
   Tika = autoclass('org.apache.tika.Tika')
   Metadata = autoclass('org.apache.tika.metadata.Metadata')
   FileInputStream = autoclass('java.io.FileInputStream')

   tika = Tika()
   tika.setMaxStringLength(10*1024*1024);
   meta = Metadata()

   # Raise an exception and continue if parsing fails
   try:
       text = tika.parseToString(FileInputStream(filename), meta)
       return text
   except (JavaException,UnicodeDecodeError), e:
       print "ERROR: %s" % (e)
   return None
Adventuress answered 26/1, 2016 at 9:37 Comment(0)
F
1

Install tika with the following pip command:

pip install tika

The following code works fine for extracting data:

import io
import os
from tika import parser

def extract_text(file):
    parsed = parser.from_file(file)
    parsed_text = parsed['content']
    parsed_text = parsed_text.lower()
    return parsed_text

file_name_with_extension = input("Enter File Name:")
text = extract_text(file_name_with_extension)
print(text)

It will print only content of the file. Supported file formats are listed here.

Fishbowl answered 14/5, 2020 at 5:9 Comment(0)
G
0

The solution given by Chris Mattmann is right. However, I would like to add a couple of inputs. Use the following code snippet to write the read PDF file into a text file. Use appropriate encoding to support UTF-8 (for example, Chinese/Japanese characters).

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import tika

tika.initVM()
from tika import parser

parsed_pdf = parser.from_file('file.pdf')

with open('file.txt', 'w', encoding='utf-8') as file:
     file.write(parsed_pdf["content"])
Gerta answered 5/12, 2019 at 6:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.