How to find the mime type of a file in python?
Asked Answered
C

19

270

Let's say you want to save a bunch of files somewhere, for instance in BLOBs. Let's say you want to dish these files out via a web page and have the client automatically open the correct application/viewer.

Assumption: The browser figures out which application/viewer to use by the mime-type (content-type?) header in the HTTP response.

Based on that assumption, in addition to the bytes of the file, you also want to save the MIME type.

How would you find the MIME type of a file? I'm currently on a Mac, but this should also work on Windows.

Does the browser add this information when posting the file to the web page?

Is there a neat python library for finding this information? A WebService or (even better) a downloadable database?

Combative answered 4/9, 2008 at 12:7 Comment(1)
Hey there to all newcomers. Just a quick overview over this topic, because the answers get harder to read with each new one: The mimetypes python standard library module uses the file extension only. Any of the several libmagic integrations use the file contents. For details read on below ;-)Preglacial
S
309

The python-magic method suggested by toivotuo is outdated. Python-magic's current trunk is at Github and based on the readme there, finding the MIME-type, is done like this.

# For MIME types
import magic
mime = magic.Magic(mime=True)
mime.from_file("testdata/test.pdf") # 'application/pdf'
Stettin answered 2/5, 2010 at 12:2 Comment(10)
thanks for the comment! please note, that "above" is a difficult concept in stackoverflow, since the ordering is grouped by votes and ordered randomly inside the groups. I am guessing you refer to @toivotuo's answer.Combative
Yeh, I didn\t have enough "points" to create comments at the time of writing this reply. But I probably should have written it as a comment, so that the @toivotuo could have edited his question.Stettin
rpm -qf /usr/lib/python2.7/site-packages/magic.py -i URL : darwinsys.com/file Summary : Python bindings for the libmagic API rpm -qf /usr/bin/file -i Name : file URL : darwinsys.com/file python-magic from darwinsys.com/file and which comes with Linux Fedora works like @toivotuo's said. And seems more main stream.Linalinacre
Since the magic library is not a standard python lib, this is very clumsy :-( Isn't there some way how to use unix file command directly? Unfortunately s = os.system("file -b --mime-type /home/me/myfile.bz2") doesn't write the MIME into s, but only prints it to stdout :-(Altercate
Beware that the debian/ubuntu package called python-magic is different to the pip package of the same name. Both are import magic but have incompatible contents. See https://mcmap.net/q/11405/-how-to-determine-the-encoding-of-text for more.Phillip
The magic module comes from filemagic on pypiPassional
Check my answer in the context of no file extension or false file extension, Python 3.X and web application https://mcmap.net/q/102852/-how-to-find-the-mime-type-of-a-file-in-pythonNessus
As I commented on toivotuo’s answer, it is not outdated! You are talking about a different library. Can you please remove or replace that statement in your answer? It currently makes finding the best solution really difficult.Preglacial
js and css are just "text/plain" with this!Vulpine
As saif by @ManojAcharya, this will not give useful results when applying it to most web-related files. Basically everything will be text/plain, except to XML file which will either be text/html or text/xml (!), depending on the library’s mood. When you need to get results for web files, you should use the mimetypes module (see the answers by dave-webb, oetzi and others below).Premaxilla
U
127

The mimetypes module in the standard library will determine/guess the MIME type from a file extension.

If users are uploading files the HTTP post will contain the MIME type of the file alongside the data. For example, Django makes this data available as an attribute of the UploadedFile object.

Ufa answered 4/9, 2008 at 12:12 Comment(10)
If the files are stored in BLOBs, as specified in the question, you may not know the file extension.Raddled
Also remember to sanitize the files when/if outputting them to other users: #1746243Beaudoin
File extensions are not a reliable way to determine mime type.Euonymus
Echoing some of the comments above, a better solution is in Simon's answer.Janeejaneen
in python 3.6 this works: mimetypes.guess_type(path_file_to_upload)[1] Buffer
While @cerin is right that file extensions are not reliable, I've just discovered that the accuracy of python-magic (as suggested in the top answer) to be even lower, as confirmed by github.com/s3tools/s3cmd/issues/198. So, mimetypes seems a better candidate for me.Kuopio
While @Euonymus is basically right, getting text/plain for CSS, JavaScript etc. and text/xml (as ASCII) for XML containing a lot of non-ASCII characters is completely useless and even potentially harmful. Hence, for now, mimetypes is the way to go.Premaxilla
@Buffer Thanks for the syntax tip. I would default to checking part [0] -- the type, not the encoding. The response for me usually looked like ("application/json", None).Syllepsis
As of Python 3.8 mimetypes.guess_type now accepts a Path-like object. So it's compatible with Python's pathlib module. issue: bugs.python.org/issue34926Dupre
I don't understand how "mimetypes is the way to go"? In that case it makes it absolutely trivial for a malicious user to bypass your protections. Now, if you're in complete control of the source and destination then maybe, but a blanket statement that mimetypes is the solution is harmful.Wolfish
T
60

This seems to be very easy

>>> from mimetypes import MimeTypes
>>> import urllib 
>>> mime = MimeTypes()
>>> url = urllib.pathname2url('Upload.xml')
>>> mime_type = mime.guess_type(url)
>>> print mime_type
('application/xml', None)

Please refer Old Post

Update - In python 3+ version, it's more convenient now:

import mimetypes
print(mimetypes.guess_type("sample.html"))
Tocharian answered 13/2, 2014 at 13:9 Comment(4)
I don't think the urllib is required in your example.Cloistral
for Python 3.X replace import urllib with from urllib import request. And then use "request" instead of urllibTelefilm
Works for python 2.7 alsoAlleen
@oetzi's solution uses this module, but is more simple.Andesine
A
56

More reliable way than to use the mimetypes library would be to use the python-magic package.

import magic
m = magic.open(magic.MAGIC_MIME)
m.load()
m.file("/tmp/document.pdf")

This would be equivalent to using file(1).

On Django one could also make sure that the MIME type matches that of UploadedFile.content_type.

Aquanaut answered 25/1, 2010 at 16:39 Comment(2)
See Simon Zimmermann's post for an updated use of python-magicCombative
@DarenThomas: As mentioned in mammadori’s answer, this answer is not outdated and distinct from Simon Zimmermann’s solution. If you have the file utility installed, you can probably use this solution. It works for me with file-5.32. On gentoo you also have to have the python USE-flag enabled for the file package.Preglacial
P
39

13 year later...
Most of the answers on this page for python 3 were either outdated or incomplete.
To get the mime type of a file I use:

import mimetypes

mt = mimetypes.guess_type("https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf")
if mt:
    print("Mime Type:", mt[0])
else:
    print("Cannot determine Mime Type")

# Mime Type: application/pdf

Live Demo


From Python docs:

mimetypes.guess_type(url, strict=True)

Guess the type of a file based on its filename, path or URL, given by url. URL can be a string or a path-like object.

The return value is a tuple (type, encoding) where type is None if the type can’t be guessed (missing or unknown suffix) or a string of the form 'type/subtype', usable for a MIME content-type header.

encoding is None for no encoding or the name of the program used to encode (e.g. compress or gzip). The encoding is suitable for use as a Content-Encoding header, not as a Content-Transfer-Encoding header. The mappings are table driven. Encoding suffixes are case sensitive; type suffixes are first tried case sensitively, then case insensitively.

The optional strict argument is a flag specifying whether the list of known MIME types is limited to only the official types registered with IANA. When strict is True (the default), only the IANA types are supported; when strict is False, some additional non-standard but commonly used MIME types are also recognized.

Changed in version 3.8: Added support for url being a path-like object.

Prelusive answered 2/2, 2021 at 15:8 Comment(2)
if file path has no extension,then mime can not get the right result.Acuminate
it is a great option, but not only does it require an extension, but also things like 'image/webp' are still not being detected.Octet
P
18

Python bindings to libmagic

All the different answers on this topic are very confusing, so I’m hoping to give a bit more clarity with this overview of the different bindings of libmagic. Previously mammadori gave a short answer listing the available option.

libmagic

When determining a files mime-type, the tool of choice is simply called file and its back-end is called libmagic. (See the Project home page.) The project is developed in a private cvs-repository, but there is a read-only git mirror on github.

Now this tool, which you will need if you want to use any of the libmagic bindings with python, already comes with its own python bindings called file-magic. There is not much dedicated documentation for them, but you can always have a look at the man page of the c-library: man libmagic. The basic usage is described in the readme file:

import magic

detected = magic.detect_from_filename('magic.py')
print 'Detected MIME type: {}'.format(detected.mime_type)
print 'Detected encoding: {}'.format(detected.encoding)
print 'Detected file type name: {}'.format(detected.name)

Apart from this, you can also use the library by creating a Magic object using magic.open(flags) as shown in the example file.

Both toivotuo and ewr2san use these file-magic bindings included in the file tool. They mistakenly assume, they are using the python-magic package. This seems to indicate, that if both file and python-magic are installed, the python module magic refers to the former one.

python-magic

This is the library that Simon Zimmermann talks about in his answer and which is also employed by Claude COULOMBE as well as Gringo Suave.

filemagic

Note: This project was last updated in 2013!

Due to being based on the same c-api, this library has some similarity with file-magic included in libmagic. It is only mentioned by mammadori and no other answer employs it.

Preglacial answered 22/6, 2018 at 10:25 Comment(0)
T
15

2017 Update

No need to go to github, it is on PyPi under a different name:

pip3 install --user python-magic
# or:
sudo apt install python3-magic  # Ubuntu distro package

The code can be simplified as well:

>>> import magic

>>> magic.from_file('/tmp/img_3304.jpg', mime=True)
'image/jpeg'
Torture answered 15/10, 2017 at 19:9 Comment(2)
can you do same for js or css file ?Derogative
Sure, why not??Torture
P
11

There are 3 different libraries that wraps libmagic.

2 of them are available on pypi (so pip install will work):

  • filemagic
  • python-magic

And another, similar to python-magic is available directly in the latest libmagic sources, and it is the one you probably have in your linux distribution.

In Debian the package python-magic is about this one and it is used as toivotuo said and it is not obsoleted as Simon Zimmermann said (IMHO).

It seems to me another take (by the original author of libmagic).

Too bad is not available directly on pypi.

Pious answered 6/9, 2012 at 10:22 Comment(1)
I added a repo for convenience: github.com/mammadori/magic-python that way you can: pip install -e git://github.com/mammadori/magic-python.git#egg=Magic_file_extensionsPious
A
10

in python 2.6:

import shlex
import subprocess
mime = subprocess.Popen("/usr/bin/file --mime " + shlex.quote(PATH), shell=True, \
    stdout=subprocess.PIPE).communicate()[0]
Arielariela answered 2/11, 2009 at 15:48 Comment(3)
This is unnecessary, since the file command is basically just a wrapper around libmagic. You may as well just use the python binding (python-magic), as in Simon's answer.Raddled
That depends on the operating system. On Mac OS X, for example, you have "file" but not libmagic in the normal environment.Nevil
this looks unsafe to me, PATH should be escaped. idk what the Python equivalent is, but php devs would use Popen("/usr/bin/file --mime ".escapeshellarg(PATH)); - for example your code would fail on files containing newlines or quotes, probably also $dollarsign, but it would also protect you against hackers doing PATH='; rm -rfv / and such shell injectionSuppositious
H
8

python 3 ref: https://docs.python.org/3.2/library/mimetypes.html

mimetypes.guess_type(url, strict=True) Guess the type of a file based on its filename or URL, given by url. The return value is a tuple (type, encoding) where type is None if the type can’t be guessed (missing or unknown suffix) or a string of the form 'type/subtype', usable for a MIME content-type header.

encoding is None for no encoding or the name of the program used to encode (e.g. compress or gzip). The encoding is suitable for use as a Content-Encoding header, not as a Content-Transfer-Encoding header. The mappings are table driven. Encoding suffixes are case sensitive; type suffixes are first tried case sensitively, then case insensitively.

The optional strict argument is a flag specifying whether the list of known MIME types is limited to only the official types registered with IANA. When strict is True (the default), only the IANA types are supported; when strict is False, some additional non-standard but commonly used MIME types are also recognized.

import mimetypes
print(mimetypes.guess_type("sample.html"))
Hellenistic answered 15/9, 2019 at 5:5 Comment(0)
S
7

You didn't state what web server you were using, but Apache has a nice little module called Mime Magic which it uses to determine the type of a file when told to do so. It reads some of the file's content and tries to figure out what type it is based on the characters found. And as Dave Webb Mentioned the MimeTypes Module under python will work, provided an extension is handy.

Alternatively, if you are sitting on a UNIX box you can use sys.popen('file -i ' + fileName, mode='r') to grab the MIME type. Windows should have an equivalent command, but I'm unsure as to what it is.

Seditious answered 4/9, 2008 at 12:22 Comment(2)
Nowdays you can just do subprocess.check_output(['file', '-b', '--mime', filename])Pontifical
There is really no reason to resort to using an external tool when python-magic does the equivalent thing, all wrapped and cozy.Reedreedbird
S
7

@toivotuo 's method worked best and most reliably for me under python3. My goal was to identify gzipped files which do not have a reliable .gz extension. I installed python3-magic.

import magic

filename = "./datasets/test"

def file_mime_type(filename):
    m = magic.open(magic.MAGIC_MIME)
    m.load()
    return(m.file(filename))

print(file_mime_type(filename))

for a gzipped file it returns: application/gzip; charset=binary

for an unzipped txt file (iostat data): text/plain; charset=us-ascii

for a tar file: application/x-tar; charset=binary

for a bz2 file: application/x-bzip2; charset=binary

and last but not least for me a .zip file: application/zip; charset=binary

Superimpose answered 3/2, 2015 at 19:9 Comment(0)
N
6

In Python 3.x and webapp with url to the file which couldn't have an extension or a fake extension. You should install python-magic, using

pip3 install python-magic

For Mac OS X, you should also install libmagic using

brew install libmagic

Code snippet

import urllib
import magic
from urllib.request import urlopen

url = "http://...url to the file ..."
request = urllib.request.Request(url)
response = urlopen(request)
mime_type = magic.from_buffer(response.readline())
print(mime_type)

alternatively you could put a size into the read

import urllib
import magic
from urllib.request import urlopen

url = "http://...url to the file ..."
request = urllib.request.Request(url)
response = urlopen(request)
mime_type = magic.from_buffer(response.read(128))
print(mime_type)
Nessus answered 6/9, 2016 at 19:55 Comment(3)
Will it be load whole file?Hidie
No, it's a stream, so normally just few bytes.Nessus
I've edited by response.readline() or response.read(128) Thank you!Nessus
N
5

I try mimetypes library first. If it's not working, I use python-magic libary instead.

import mimetypes
def guess_type(filename, buffer=None):
mimetype, encoding = mimetypes.guess_type(filename)
if mimetype is None:
    try:
        import magic
        if buffer:
            mimetype = magic.from_buffer(buffer, mime=True)
        else:
            mimetype = magic.from_file(filename, mime=True)
    except ImportError:
        pass
return mimetype
Nihil answered 22/5, 2019 at 1:18 Comment(0)
I
2

The mimetypes module just recognise an file type based on file extension. If you will try to recover a file type of a file without extension, the mimetypes will not works.

Influent answered 19/6, 2012 at 12:51 Comment(1)
I don't think that's true. The MIME type is about how to tell others about a data format, not about how to find out the data format yourself. If you use a tool that guesses the format only based on the extension and prints out MIME types then you can't use that tool if there are no file extensions. But other ways to guess the format are possible as well, e.g., by checking with a parser.Tabasco
R
2

I'm surprised that nobody has mentioned it but Pygments is able to make an educated guess about the mime-type of, particularly, text documents.

Pygments is actually a Python syntax highlighting library but is has a method that will make an educated guess about which of 500 supported document types your document is. i.e. c++ vs C# vs Python vs etc

import inspect

def _test(text: str):
    from pygments.lexers import guess_lexer
    lexer = guess_lexer(text)
    mimetype = lexer.mimetypes[0] if lexer.mimetypes else None
    print(mimetype)

if __name__ == "__main__":
    # Set the text to the actual defintion of _test(...) above
    text = inspect.getsource(_test)
    print('Text:')
    print(text)
    print()
    print('Result:')
    _test(text)

Output:

Text:
def _test(text: str):
    from pygments.lexers import guess_lexer
    lexer = guess_lexer(text)
    mimetype = lexer.mimetypes[0] if lexer.mimetypes else None
    print(mimetype)


Result:
text/x-python

Now, it's not perfect, but if you need to be able to tell which of 500 document formats are being used, this is pretty darn useful.

Radiochemical answered 1/7, 2020 at 8:54 Comment(0)
S
1

I 've tried a lot of examples but with Django mutagen plays nicely.

Example checking if files is mp3

from mutagen.mp3 import MP3, HeaderNotFoundError  

try:
    audio = MP3(file)
except HeaderNotFoundError:
    raise ValidationError('This file should be mp3')

The downside is that your ability to check file types is limited, but it's a great way if you want not only check for file type but also to access additional information.

Sirree answered 19/8, 2017 at 10:33 Comment(0)
D
1

For byte Array type data you can use magic.from_buffer(_byte_array,mime=True)

Dapplegray answered 25/7, 2018 at 4:43 Comment(0)
D
0

I have had trouble getting any of the magic implementation modules to work under MSYS2 with Python 3. What I settled on was calling the file executable & falling back to the mimetypes module.

import mimetypes
import os
import subprocess

def getMimeType(filepath):
  mime_type = None
  # replace '<system_root>/usr/bin/file.exe' with path to 'file' executuble for your system
  if os.path.isfile(filepath) and os.path.isfile("<system_root>/usr/bin/file.exe"):
    res = subprocess.run(["/usr/bin/file", "--mime-type", "--brief", filepath], stdout=subprocess.PIPE)
    if res.stdout:
      mime_type = res.stdout.decode("utf-8")
  if not mime_type:
    # fallback to guessing by filename extension
    mime_type = mimetypes.guess_type(filepath)[0]
  return mime_type

Note that it would be good to have a function to search PATH for the 'file' executable instead of hardcoding it in.

You could also use this as a fallback if a usable magic module isn't found.

__have_magic = False
try:
  import magic
  __have_magic = True
except ModuleNotFoundError:
  pass

...

def getMimeType(filepath):
  mime_type = None
  if __have_magic and os.path.isfile(filepath):
    mime_type = magic.from_file(filepath, mime=True)
  if not mime_type:
    # fallback to 'file' call
    ...
Dystrophy answered 11/4, 2023 at 10:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.