Handle ligatures in Apache Tika
Asked Answered
O

0

7

Tika doesn't seem to recognize ligatures (fi, ff, fl...) in PDF files and replaces them with question marks.

Any idea (not only on Tika) to extract PDF text while converting character ligatures to separated characters ?

File file = new File("path/to/file.pdf");
String text = Tika().parseToString(file);

Edit

My PDF file is UTF-8 encoded (that's what InputStream.getEncoding() says), my platform encoding is also UTF-8. Even with a -Dfile.encoding=UTF8, it is not working.

For instance, I'm supposed to have : "différentes implémentations" ...and that's what I really get : "di��erentes impl�ementations"

Overissue answered 12/3, 2014 at 10:30 Comment(7)
Is there a chance that those characters are not in your working charset?Agave
It seems ok ; however, a Tika changelog says : "Invalid characters are now replaced with the Unicode replacement character (U+FFFD)" i.e., question marks. I tried the same operation with Snowtide's PDFTextStream and those ligatures are replaced with spaces instead.Overissue
What are you doing with your text object after parsing it? If you output it anywhere, you need to ensure that that output is in the right encoding, and whatever you display it with supports those codepoints!Polyurethane
I convert my String into a JSONObject (in order to use it as a post request for ElasticSearch's indexing). fyi edit: Detected encoding is UTF-8 ; my platform encoding is UTF-8.Overissue
And btw, I have the same issue with U+0065 & U+0301 combined char that gives "é". I don't know if it helps, but this PDF file was originally written in LaTeX and encoded with MiKTeX-xdvipdfmx (0.7.8)Overissue
I decided to use node-tika npm package instead. It works.Overissue
It's now 10 years later. Ligatures in text extraction should work if it works with PDFBox, and PDFBox is trying to be as good as Adobe Reader.Chaing

© 2022 - 2024 — McMap. All rights reserved.