How do I read Japanese characters from a PDF?
Asked Answered
R

1

3

I'm parsing a PDF file using IText7 in C# that contains Japanese characters like so:

    public static string ExtractTextFromPDF(string filePath)
    {
        var pdfReader = new PdfReader(filePath);
        var pdfDoc = new PdfDocument(pdfReader);
        var sb = new StringBuilder();
        for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
        {
            var strategy = new SimpleTextExtractionStrategy();
            sb.Append(PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy));
        }
        pdfDoc.Close();
        pdfReader.Close();
        return sb.ToString();
    }

But I run into the exception:

iText.IO.IOException: 'The CMap iText.IO.Font.Cmap.UniJIS-UTF16-H was not found.'

I've searched around for a solution on how to add this but I haven't come up with anything that works for the Japanese characters. If there is any other library more suited that would also be ok. Any help?

Thanks

Raila answered 23/6, 2020 at 7:55 Comment(2)
Have you included the com.itextpdf:font-asian dependency?Sit
Thanks @Sit I hadn't installed that dependency nor found any reference to it while searching, but that certainly did the trick, now the pdf is parsed correctly with my original code. Please write your suggestion as an answer and I can mark it as the solution. Thanks! :)Raila
S
7

Encoding CMaps in particular for CJK scripts are in a separate package.

For .Net use itext7.font-asian via nuget.

For Java use com.itextpdf:font-asian via maven.

The existence of this package is more visible for the Java version than for the .Net version.

Sit answered 23/6, 2020 at 14:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.