Is Apache Tika able to extract foreign languages like Chinese, Japanese?

Asked 26/3, 2013 at 13:58 Answered 18/9, 2013 at 9:54

I have the following code:

    Detector detector = new DefaultDetector();
    Parser parser = new AutoDetectParser(detector);
    InputStream stream = new ByteArrayInputStream(bytes);
    OutputStream outputstream = new ByteArrayOutputStream();
    ContentHandler textHandler = new BodyContentHandler(outputstream);
    Metadata metadata = new Metadata();
    // Set<String> langs = LanguageIdentifier.getSupportedLanguages();
    // metadata.set(Metadata.CONTENT_LANGUAGE, lang);
    // metadata.set(Metadata.FORMAT, hint);
    ParseContext context = new ParseContext();
    try {
        parser.parse(stream, textHandler, metadata, context);
        String extractedText = outputstream.toString();
        return extractedText;
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (TikaException e) {
        e.printStackTrace();
    }

If the input is a doc file that contains Chinese characters, each Chinese characters will be extracted as "?".

Thanks a lot!

Cretic answered 26/3, 2013 at 13:58 Comment(1)

Tika should be able to handle them just fine. Are you sure you've got encoding correct when you output / view the text? (Hint - it'll most likely need to be something like UTF-8, and you'll need to display it using a font that has glyphs for chinese characters!) – Oakman 27/3, 2013 at 5:34

Apache Tika is able to extract unicode text from its supported file formats. As long as the file format can store unicode text (eg Chinese or Japanese characters), Apache Tika can extract it

Tika also includes a number of unit tests for this, which verify it works. One such test uses this sample chinese email. If with use the command line Tika app, and grab the first few lines, we see it working:

$ java -jar tika-app-1.4.jar --text testMSG_chinese.msg | head
Alfresco MSG format testing ( MSG 格式測試 )
    From
    Tests Chang@FT (張毓倫)
    To
    Tests Chang@FT (張毓倫)
    Recipients
    [email protected]

Or with this Japanese document:

$ java -jar tika-app-1.4.jar --text testRTFJapanese.rtf | head -2
ゾルゲの処刑記録、
ゾルゲと尾崎、淡々と最期

You'll just need to ensure that any text output you generate gets stored in a suitable encoding (eg utf8), and the font you use to display it supports those glyphs!

Oakman answered 18/9, 2013 at 9:54 Comment(0)

I have not seen anywhere written that Apache Tika does not support foreign languages like Chinese and Japanese. But when looking at following Apache Tika source file, I could not find both of the languages.

http://svn.apache.org/repos/asf/tika/branches/1.4/tika-core/src/main/resources/org/apache/tika/language/tika.language.properties

However you can still try out the implementing in same way as discussed in five min user guide to test with your Chinese Doc file

https://tika.apache.org/1.4/parser_guide.html

Housley answered 18/9, 2013 at 6:37 Comment(1)

The code you're referencing is for language detection not for text extraction, which is a different bit of Tika – Oakman 18/9, 2013 at 9:10

Recommended topics

Hot tags