Getting MimeType subtype with Apache tika
Asked Answered
M

4

17

I'd need to get the iana.org MediaType rather than application/zip or application/x-tika-msoffice for documents like, odt, ppt, pptx, xlsx etc.

If you look at mimetypes.xml there are mimeType elements composed of the iana.org mime-type and "sub-class-of"

   <mime-type type="application/msword">
    <alias type="application/vnd.ms-word"/>
    ............................
    <glob pattern="*.doc"/>
    <glob pattern="*.dot"/>
    <sub-class-of type="application/x-tika-msoffice"/>
  </mime-type>

How to get the iana.org mime-type name instead of the parent type name ?

When testing mime type detection, I do :

MediaType mediaType = MediaType.parse(tika.detect(inputStream));
String mimeType = mediaType.getSubtype();

Test Results :

FAILED: getsCorrectContentType("application/vnd.ms-excel", docs/xls/en.xls)
java.lang.AssertionError: expected:<application/vnd.ms-excel> but was:<x-tika-msoffice>

FAILED: getsCorrectContentType("vnd.openxmlformats-officedocument.spreadsheetml.sheet", docs/xlsx/en.xlsx)
java.lang.AssertionError: expected:<vnd.openxmlformats-officedocument.spreadsheetml.sheet> but was:<zip>

FAILED: getsCorrectContentType("application/msword", doc/en.doc)
java.lang.AssertionError: expected:<application/msword> but was:<x-tika-msoffice>

FAILED: getsCorrectContentType("application/vnd.openxmlformats-officedocument.wordprocessingml.document", docs/docx/en.docx)
java.lang.AssertionError: expected:<application/vnd.openxmlformats-officedocument.wordprocessingml.document> but was:<zip>

FAILED: getsCorrectContentType("vnd.ms-powerpoint", docs/ppt/en.ppt)
java.lang.AssertionError: expected:<vnd.ms-powerpoint> but was:<x-tika-msoffice>

Is there any way to get the actual subtype from mimetypes.xml ? Instead of x-tika-msoffice or application/zip ?

Moreover I never get application/x-tika-ooxml, but application/zip for xlsx, docx, pptx documents.

Medor answered 21/8, 2011 at 10:14 Comment(0)
M
3

The default byte pattern detection rules in tika-core can only detect the generic OLE2 or ZIP format used by all MS Office document types. You want to use ContainerAwareDetector for this kind of detection afaik. And use MimeTypes detector as its fallback detector. Try this :

public MediaType getContentType(InputStream is, String fileName) {
    MediaType mediaType;
    Metadata md = new Metadata();
    md.set(Metadata.RESOURCE_NAME_KEY, fileName);
    Detector detector = new ContainerAwareDetector(tikaConfig.getMimeRepository());

    try {
        mediaType = detector.detect(is, md);
    } catch (IOException ioe) {
        whatever;
    }
    return mediaType;
}

This way your tests should pass

Medor answered 22/8, 2011 at 9:19 Comment(1)
ContainerAwareDetector has been deprecated for some time now in Tika, for anyone looking at this today you should instead be using Tika's new-ish DefaultDetector coupled with all the tika parsers on your classpathBaeyer
H
34

Originally, Tika only supported detection by Mime Magic or by file extension (glob), as this is all most mime detection before Tika did.

Because of the problems with Mime Magic and globs when it comes to detecting container formats, it was decided to add some new detectors to Tika to handle these. The Container Aware Detectors took the whole file, opened and processed the container, and then worked out the exact file type based on the contents. Initially, you needed to call them explicitly, but then they were wrapped up in ContainerAwareDetector which you'll see in some of the answers.

Since then, Tika has added a service loader pattern, initially for Parsers. This allowed classes to be auto-loaded when present, with a general way to identify which ones were appropriate and use those. This support was then extended to cover Detectors too, at which point the old ContainerAwareDetector could be removed in favour of something cleaner.

If you're on Tika 1.2 or later, and you want accurate detection of all formats, including container formats, you want to do something like:

 TikaConfig config = TikaConfig.getDefaultConfig();
 Detector detector = config.getDetector();

 TikaInputStream stream = TikaInputStream.get(fileOrStream);

 Metadata metadata = new Metadata();
 metadata.add(Metadata.RESOURCE_NAME_KEY, filenameWithExtension);
 MediaType mediaType = detector.detect(stream, metadata);

If you run this with only the Core Tika jar (tika-core-1.2-....), then the only detector present will be the mime magics one, and you'll get the old style detection based on magic + glob only. However, if you run this with both the Core and Parser Tika jars (plus their dependencies), or from Tika App (which includes core + parsers + dependencies automatically), then the DefaultDetector will use all the various different Container Detectors to process your file. If your file is zip based, then detection will include processing the zip structure to identify the file type based on what's in there. This will give you the high accuracy detection you're after, without needing to call lots of different parsers in turn. DefaultDetector will use all Detectors that are available.

Hallette answered 1/7, 2012 at 15:4 Comment(4)
How do I detect a .properties file with tika-app1.8. Its detecting it as text/plain rather I want it as text/properties. How do I customize this?Orbicular
@Orbicular You need to ask that as a new question and/or raise an enhancement request in the Tika issue trackerHallette
What are the dependencies required along with the Parsers jar? Are they in a separate jar/s of their own?Cibis
The key is to include the tika parses in the dependencies (along with core) and then you can simply use Tika.detect(tikaInputStream) and that will do the job. No need for the metadata, mediatype or extracting the detector.Guanidine
A
5

For anyone else having a similar problem but using newer Tika version this should do the trick:

  1. Use ZipContainerDetector since you may have no ContainerAwareDetector any more.
  2. Give a TikaInputStream to the detect() method of the detector to ensure tika can analyze the correct mime type.

My example code looks like this:

public static String getMimeType(final Document p_document)
{
    try
    {
        Metadata metadata = new Metadata();
        metadata.add(Metadata.RESOURCE_NAME_KEY, p_document.getDocName());

        Detector detector = getDefaultDectector();

        LogMF.debug(log, "Trying to detect mime type with detector {0}.", detector);
        TikaInputStream inputStream = TikaInputStream.get(p_document.getData(), metadata);

        return detector.detect(inputStream, metadata).toString();
    }
    catch (Throwable t)
    {
        log.error("Error while determining mime-type of " + p_document);
    }

    return null;
}

private static Detector getDefaultDectector()
{
    if (detector == null)
    {
        List<Detector> detectors = new ArrayList<>();

        // zip compressed container types
        detectors.add(new ZipContainerDetector());
        // Microsoft stuff
        detectors.add(new POIFSContainerDetector());
        // mime magic detection as fallback
        detectors.add(MimeTypes.getDefaultMimeTypes());

        detector = new CompositeDetector(detectors);
    }

    return detector;
}

Note that the Document class is part of my domain model. So you will for sure have something similar at that line.

I hope that someone can use this.

Antimonic answered 26/6, 2012 at 9:15 Comment(4)
You'd be much much better off just using DefaultDetector, rather than trying to call individual detectors yourselfHallette
I could not detect the mime-type of a word 2010 document with the default detector. Using my approach I can. But I haven't tested it against other document types.Rimmer
DefaultDetector should work for that (there are a load of unit tests that show that!). Make sure you have the Tika Parsers jar on your classpath, along with the dependencies, if it doesn'tHallette
I hope no one uses code that catches Throwable and returns null.Saxe
M
3

The default byte pattern detection rules in tika-core can only detect the generic OLE2 or ZIP format used by all MS Office document types. You want to use ContainerAwareDetector for this kind of detection afaik. And use MimeTypes detector as its fallback detector. Try this :

public MediaType getContentType(InputStream is, String fileName) {
    MediaType mediaType;
    Metadata md = new Metadata();
    md.set(Metadata.RESOURCE_NAME_KEY, fileName);
    Detector detector = new ContainerAwareDetector(tikaConfig.getMimeRepository());

    try {
        mediaType = detector.detect(is, md);
    } catch (IOException ioe) {
        whatever;
    }
    return mediaType;
}

This way your tests should pass

Medor answered 22/8, 2011 at 9:19 Comment(1)
ContainerAwareDetector has been deprecated for some time now in Tika, for anyone looking at this today you should instead be using Tika's new-ish DefaultDetector coupled with all the tika parsers on your classpathBaeyer
B
2

You can use a custom tika config file:

MimeTypes mimes=MimeTypesFactory.create(Thread.currentThread()
   .getContextClassLoader().getResource("tika-custom-MimeTypes.xml"));
Metadata metadata = new Metadata();
metadata.add(Metadata.RESOURCE_NAME_KEY, file.getName());
tis = TikaInputStream.get(file);
String mimetype = new  DefaultDetector(mimes).detect(tis,metadata).toString();

In the WEB-INF/classes put the "tika-custom-MimeTypes.xml" with your changes:

In my case:

<mime-type type="video/mp4">
    <magic priority="60">
      <match value="ftypmp41" type="string" offset="4"/>
      <match value="ftypmp42" type="string" offset="4"/>
      <!-- add -->
      <match value="ftyp" type="string" offset="4"/>
    </magic>
    <glob pattern="*.mp4"/>
    <glob pattern="*.mp4v"/>
    <glob pattern="*.mpg4"/>
    <!-- sub-class-of type="video/quicktime" /-->
</mime-type>
<mime-type type="video/quicktime">
    <magic priority="50">
      <match value="moov" type="string" offset="4"/>
      <match value="mdat" type="string" offset="4"/>
      <!--remove for videos of screencast -->
      <!--match value="ftyp" type="string" offset="4"/-->
    </magic>
    <glob pattern="*.qt"/>
    <glob pattern="*.mov"/>
</mime-type>
Blinkers answered 8/3, 2015 at 6:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.