determining whether a MIME type is binary or text-based
Asked Answered
C

3

8

Is there a library which allows determining whether a given content type is binary or text-based?

Obviously text/* is always textual, but for things like application/json, image/svg+xml or even application/x-latex it's rather tricky without inspecting the actual data.

Cenis answered 7/10, 2010 at 6:47 Comment(2)
Why don't you tell us what you're trying to do with the data?Egomania
Sorry, I should indeed have provided more details. Essentially (while simplified, this encapsulates the basics), I am lazily loading data - if an item is text-based, the process is different from loading binary data (only content type is known beforehand).Cenis
R
2

There's a wrapper for libmagic for python -- pymagic. Thats the easiest method to accomplish what you want. Keep in mind that magic is only as good as the fingerprint. You can have false-positives if something 'looks' like another file format, but most cases pymagic will give you what you need.

One thing to watch out for would be the 'simple solution' of checking to see if any of the characters are 'outside' the printable ASCII range, as you will likely encounter unicode which will look like binary (and in fact, be binary) even though it's just textual content.

Runner answered 7/10, 2010 at 7:21 Comment(3)
He's asking for determining whether a MIME type is binary, not for determining the MIME type based on file data.Simulate
It's questionable if you'd want to automatically trust the mime-type that the server provides, but if you do then you could compare against the IANA MIME type registry iana.org/assignments/media-types/index.html although there's not a clear line between 'mime type XYZ is binary/text', in most cases you just get redirected towards another RFC with the details buried inside. libmagic just reads a handful of bytes and can reasonably detect content type. Plus, there's always the chance someone will have a random mime-type that they made up for their custom client..Runner
I guess this comment pretty much answers my question; what I want is not possible (see my comment above). Fair enough, I can work around that, it just won't be quite as elegant then...Cenis
S
5

I don't know of a definitive list of binary and non-binary MIME types, but for the Common MIME types I think the following does pretty well.

def is_binary(mime_type, subtype):
    if mime_type == "text":
        return False
    if mime_type != "application":
        return True
    return subtype not in ["json", "ld+json", "x-httpd-php", "x-sh", "x-csh", "xhtml+xml", "xml"]
Subtle answered 11/2, 2021 at 22:56 Comment(2)
An elegant solution, thanks.Deepseated
Just got bit in the ass by this; image/svg+xml is textFuttock
E
2

Usually programs that determine MIME type will also tell you the character set. For instance, file(1) (and corresponding libmagic) will give the following output:

> file --mime-encoding /bin/ls
/bin/ls: binary
> file --mime-encoding /etc/passwd
/etc/passwd: us-ascii
Egomania answered 7/10, 2010 at 6:51 Comment(1)
Thanks, but this requires access to the actual data, which is not available - see my comment amending the original post.Cenis
R
2

There's a wrapper for libmagic for python -- pymagic. Thats the easiest method to accomplish what you want. Keep in mind that magic is only as good as the fingerprint. You can have false-positives if something 'looks' like another file format, but most cases pymagic will give you what you need.

One thing to watch out for would be the 'simple solution' of checking to see if any of the characters are 'outside' the printable ASCII range, as you will likely encounter unicode which will look like binary (and in fact, be binary) even though it's just textual content.

Runner answered 7/10, 2010 at 7:21 Comment(3)
He's asking for determining whether a MIME type is binary, not for determining the MIME type based on file data.Simulate
It's questionable if you'd want to automatically trust the mime-type that the server provides, but if you do then you could compare against the IANA MIME type registry iana.org/assignments/media-types/index.html although there's not a clear line between 'mime type XYZ is binary/text', in most cases you just get redirected towards another RFC with the details buried inside. libmagic just reads a handful of bytes and can reasonably detect content type. Plus, there's always the chance someone will have a random mime-type that they made up for their custom client..Runner
I guess this comment pretty much answers my question; what I want is not possible (see my comment above). Fair enough, I can work around that, it just won't be quite as elegant then...Cenis

© 2022 - 2024 — McMap. All rights reserved.