It is possible to use an AutoDetectParser
to extract images, without relying on PDFParser
. This code works just as well for extracting images out from docx, pptx, etc.
Here I have a parseDocument()
and a setPdfConfig()
function which makes use of an AutoDetectParser
.
- I create an
AutoDetectParser
- Attach a
EmbeddedDocumentExtractor
onto a ParseContext
.
- Attach the
AutoDetectParser
onto the same ParseContext
.
- Attach a
PDFParserConfig
onto the same ParseContext
.
- Then give that
ParseContext
to AutoDetectParser.parse()
.
The images are saved to a folder in the same location as the source file, with the name <sourceFile>_/
.
private static void setPdfConfig(ParseContext context) {
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
pdfConfig.setExtractUniqueInlineImagesOnly(true);
context.set(PDFParserConfig.class, pdfConfig);
}
private static String parseDocument(String path) {
String xhtmlContents = "";
AutoDetectParser parser = new AutoDetectParser();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
EmbeddedDocumentExtractor embeddedDocumentExtractor =
new EmbeddedDocumentExtractor() {
@Override
public boolean shouldParseEmbedded(Metadata metadata) {
return true;
}
@Override
public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
throws SAXException, IOException {
Path outputDir = new File(path + "_").toPath();
Files.createDirectories(outputDir);
Path outputPath = new File(outputDir.toString() + "/" + metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
Files.deleteIfExists(outputPath);
Files.copy(stream, outputPath);
}
};
context.set(EmbeddedDocumentExtractor.class, embeddedDocumentExtractor);
context.set(AutoDetectParser.class, parser);
setPdfConfig(context);
try (InputStream stream = new FileInputStream(path)) {
parser.parse(stream, handler, metadata, context);
xhtmlContents = handler.toString();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException | TikaException e) {
e.printStackTrace();
}
return xhtmlContents;
}
--extract
flag to test the image extraction? – Peterec