Handle ligatures in Apache Tika

About

Asked 12/3, 2014 at 10:30 Answered 12/3, 2014 at 10:30

java pdf character-encoding apache-tika ligature

Tika doesn't seem to recognize ligatures (fi, ff, fl...) in PDF files and replaces them with question marks.

Any idea (not only on Tika) to extract PDF text while converting character ligatures to separated characters ?

File file = new File("path/to/file.pdf");
String text = Tika().parseToString(file);

Edit

My PDF file is UTF-8 encoded (that's what InputStream.getEncoding() says), my platform encoding is also UTF-8. Even with a -Dfile.encoding=UTF8, it is not working.

For instance, I'm supposed to have : "différentes implémentations" ...and that's what I really get : "di��erentes impl�ementations"

Overissue answered 12/3, 2014 at 10:30 Comment(7)

Is there a chance that those characters are not in your working charset? – Agave 12/3, 2014 at 10:37

It seems ok ; however, a Tika changelog says : "Invalid characters are now replaced with the Unicode replacement character (U+FFFD)" i.e., question marks. I tried the same operation with Snowtide's PDFTextStream and those ligatures are replaced with spaces instead. – Overissue 12/3, 2014 at 10:55

What are you doing with your text object after parsing it? If you output it anywhere, you need to ensure that that output is in the right encoding, and whatever you display it with supports those codepoints! – Polyurethane 12/3, 2014 at 19:23

I convert my String into a JSONObject (in order to use it as a post request for ElasticSearch's indexing). fyi edit: Detected encoding is UTF-8 ; my platform encoding is UTF-8. – Overissue 14/3, 2014 at 13:26

And btw, I have the same issue with U+0065 & U+0301 combined char that gives "é". I don't know if it helps, but this PDF file was originally written in LaTeX and encoded with MiKTeX-xdvipdfmx (0.7.8) – Overissue 14/3, 2014 at 13:36

I decided to use node-tika npm package instead. It works. – Overissue 27/3, 2014 at 15:13

It's now 10 years later. Ligatures in text extraction should work if it works with PDFBox, and PDFBox is trying to be as good as Adobe Reader. – Chaing 14/7 at 8:43

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags