Is Apache Tika able to extract foreign languages like Chinese, Japanese?
I have the following code:
Detector detector = new DefaultDetector();
Parser parser = new AutoDetectParser(detector);
InputStream stream = new ByteArrayInputStream(bytes);
OutputStream outputstream = new ByteArrayOutputStream();
ContentHandler textHandler = new BodyContentHandler(outputstream);
Metadata metadata = new Metadata();
// Set<String> langs = LanguageIdentifier.getSupportedLanguages();
// metadata.set(Metadata.CONTENT_LANGUAGE, lang);
// metadata.set(Metadata.FORMAT, hint);
ParseContext context = new ParseContext();
try {
parser.parse(stream, textHandler, metadata, context);
String extractedText = outputstream.toString();
return extractedText;
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
If the input is a doc file that contains Chinese characters, each Chinese characters will be extracted as "?".
Thanks a lot!