DOCX File type in PHP finfo_file is application/zip
Asked Answered
T

7

21

hello I'm trying to validate an uploaded file type by finfo_file function.

But when a .docx file is sent, the file type is:

application/zip

instead of:

application/vnd.openxmlformats-officedocument.wordprocessingml.document

how can I change this behavior?

Turbinal answered 6/7, 2011 at 10:49 Comment(6)
actually, the new \w{3}x formats are zipped XMLs. you can change the ending to .zip and extract them. I know it doesn't help, but its nice to know :DNoleta
extracting the file for tests is really not a solutionTurbinal
The type of the file is zip. If you want to know the type/format of the content, there is no way around into looking in it.Fanion
I agree it's acceptable but only in small apps, temporarily i used checking whether file is 'application/zip' from finfo_file and 'application/vnd.openxmlformat...' from $_FILES["file"]["type"]Turbinal
For what it's worth, I've got the same code returning application/vnd.openxmlformats-officedocument.wordprocessingml.document and application/zip for the same file on different servers - Debian and Centos respectively. This makes Laravel's validation for docx fail on the latter and work fine on the former. So be careful, test in the environment you deploy your code to.Faitour
I am also facing the same issue. The strange thing is - it is being detected as application/zip when uploaded from some systems only.Sankhya
F
14

As far as I now the vendor specific file types (vnd.) are not standardized (by any RFC) and therefore not covered by file_info(). .docx is a zipped xml-format and thats the reason, why file_info() returns application_zip (what is completely right). You may unzip the file and test the mime-type of the result, but that will lead to xml (what is completely correct too) and other files, that are used by the document. To differ between different XML formats file_info() had to analyze its content and it must know, how it looks, what goes just to far.

Fanion answered 6/7, 2011 at 10:56 Comment(4)
As far as I know, unless you extract the contents and examine them, there is nothing to distinguish any zip file (jar, docx, odf, zip, etc) from any other.Cornerstone
maybe there is a way to put them into php.ini some how?Turbinal
Even if php knows about the mime-type: finfo_file() is designed to get the type of the file, not of its content. Its also not that easy to distinguish between such complex structures unambiguously. The document itself is just application/xml, thus you need to look into and analyze it too.Fanion
@Cornerstone from my comment to the question - it does distinguish the type correctly in some circumstances.Faitour
P
9

This works on debian. Add this to /etc/magic:

#------------------------------------------------------------------------------
# $File: msooxml,v 1.1 2011/01/25 18:36:19 christos Exp $
# msooxml:  file(1) magic for Microsoft Office XML
# From: Ralf Brown <[email protected]>

# .docx, .pptx, and .xlsx are XML plus other files inside a ZIP
#   archive.  The first member file is normally "[Content_Types].xml".
# Since MSOOXML doesn't have anything like the uncompressed "mimetype"
#   file of ePub or OpenDocument, we'll have to scan for a filename
#   which can distinguish between the three types

# start by checking for ZIP local file header signature
0               string          PK\003\004
# make sure the first file is correct
>0x1E           string          [Content_Types].xml
# skip to the second local file header
#   since some documents include a 520-byte extra field following the file
#   header,  we need to scan for the next header
>>(18.l+49)     search/2000     PK\003\004
# now skip to the *third* local file header; again, we need to scan due to a
#   520-byte extra field following the file header
>>>&26          search/1000     PK\003\004
# and check the subdirectory name to determine which type of OOXML
#   file we have
>>>>&26         string          word/           Microsoft Word 2007+
!:mime application/msword
>>>>&26         string          ppt/            Microsoft PowerPoint 2007+
!:mime application/vnd.ms-powerpoint
>>>>&26         string          xl/             Microsoft Excel 2007+
!:mime application/vnd.ms-excel
>>>>&26         default         x               Microsoft OOXML
!:strength +10

Then, tell php to use /etc/magic as it's database:

$finfo = finfo_open(FILEINFO_MIME,"/etc/magic");
Pegpega answered 19/7, 2012 at 15:31 Comment(2)
THANKS. I will surely test this!!! Do you think it will work with PHP open_basedir?Turbinal
This works great for me when I test it with a .docx file and I've uploaded the file. Testing it on my local file system didn't work.Mev
A
5

This is because a DOCX is a ZIP file:

An Office Open XML file is a ZIP-compatible OPC package containing XML documents and other resources.

Like Open Office files, the documents are ZIPs containing various resources in a structured and well-defined manner. So when you try to identify the file content, you first see that it is a ZIP file. You would then need to look inside the ZIP to decide whether it's a DOCX or OpenOffice file.

As an alternative, you could have a look at the file extension: if you identify the file to be a ZIP and the extension happens to be .doc or .docx then you can assume it to be an OOXML file.

Appetence answered 6/7, 2011 at 10:57 Comment(0)
C
1

See my answer in this thread:

Overview

PHP uses libmagic. When Magic detects the MIME type as "application/zip" instead of "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", this is because the files added to the ZIP archive need to be in a certain order.

This causes a problem when uploading files to services that enforce matching file extension and MIME type. For example, Mediawiki-based wikis (written using PHP) are blocking certain XLSX files from being uploaded because they are detected as ZIP files.

What you need to do is fix your XLSX by reordering the files written to the ZIP archive so that Magic can detect the MIME type properly.

...

The post continues to analyze the file and develop a solution by rewriting the file.

Here is the file list for a DOCX file created using Word.

$ unzip -l Word.docx
Archive:  Word.docx
  Length      Date    Time    Name
---------  ---------- -----   ----
     1364  1980-01-01 00:00   [Content_Types].xml
      734  1980-01-01 00:00   _rels/.rels
      817  1980-01-01 00:00   word/_rels/document.xml.rels
     1823  1980-01-01 00:00   word/document.xml
     6799  1980-01-01 00:00   word/theme/theme1.xml
     2068  1980-01-01 00:00   docProps/thumbnail.emf
     2652  1980-01-01 00:00   word/settings.xml
     1954  1980-01-01 00:00   word/fontTable.xml
      576  1980-01-01 00:00   word/webSettings.xml
      735  1980-01-01 00:00   docProps/core.xml
    28979  1980-01-01 00:00   word/styles.xml
      709  1980-01-01 00:00   docProps/app.xml
---------                     -------
    49210                     12 files

You may have to imitate that file order or try writing the "[Content_Types].xml", "word/document.xml", and "word/styles.xml" files first before other files.

Crossley answered 21/12, 2019 at 17:44 Comment(1)
Is having [Content_Types].xml as the first archive member a requirement by OpenXML, or is it only a shortcoming of libmagic?Gompers
K
0

We had the same problem with PHP 5.3. It works fine under PHP 7.2. I have application/vnd.openxmlformats-officedocument.wordprocessingml.document for my docx file.

To ensure that you have a docx file under PHP 5.3, you check the mime type from the [Content_Types].xml file in the archive (docx).

Knott answered 30/10, 2018 at 14:52 Comment(0)
C
0

PHP 7.3 can now detect it properly using finfo_file(). The fileinfo PHP extension uses bundled libmagic and seems the library already detects .docx files correctly in all currently supported PHP versions (7.4, 8.0, 8.1).

Running

finfo_file(finfo_open(FILEINFO_MIME_TYPE), 'test.docx');

now returns

application/vnd.openxmlformats-officedocument.wordprocessingml.document

You can see the result of the same function call on older PHP versions here https://3v4l.org/uSqkR - notice the change on 7.3. The example is using finfo_buffer() and Base64-encoded file so that I can have the file "inlined" in the PHP code.

If the correct type is not detected, it's possible you may be using (even unknowingly) a custom "magic" database which does not support the type. You can specify the database as an extra parameter to finfo_open(), for example

finfo_open(FILEINFO_MIME_TYPE, '/etc/magic.mime');

If the code you're using is doing that, and you're using PHP 7.3 or newer, remove the /etc/magic.mime parameter.

The database can also be specified using the MAGIC environment variable, check you have it unset with for example getenv('MAGIC') or in phpinfo() output. If that's the case you can remove the variable wherever is set, or unset it in your PHP code (putenv('MAGIC')) before using finfo_open():

putenv('MAGIC');
finfo_file(finfo_open(FILEINFO_MIME_TYPE), 'test.docx');
Cretic answered 1/2, 2022 at 23:56 Comment(0)
I
-1

On apache in .htaccess add this, to fix the docx and all the other file types issues:

AddType application/vnd.ms-word.document.macroEnabled.12 .docm
AddType application/vnd.openxmlformats-officedocument.wordprocessingml.document docx
AddType application/vnd.openxmlformats-officedocument.wordprocessingml.template dotx
AddType application/vnd.ms-powerpoint.template.macroEnabled.12 potm
AddType application/vnd.openxmlformats-officedocument.presentationml.template potx
AddType application/vnd.ms-powerpoint.addin.macroEnabled.12 ppam
AddType application/vnd.ms-powerpoint.slideshow.macroEnabled.12 ppsm
AddType application/vnd.openxmlformats-officedocument.presentationml.slideshow ppsx
AddType application/vnd.ms-powerpoint.presentation.macroEnabled.12 pptm
AddType application/vnd.openxmlformats-officedocument.presentationml.presentation pptx
AddType application/vnd.ms-excel.addin.macroEnabled.12 xlam
AddType application/vnd.ms-excel.sheet.binary.macroEnabled.12 xlsb
AddType application/vnd.ms-excel.sheet.macroEnabled.12 xlsm
AddType application/vnd.openxmlformats-officedocument.spreadsheetml.sheet xlsx
AddType application/vnd.ms-excel.template.macroEnabled.12 xltm
AddType application/vnd.openxmlformats-officedocument.spreadsheetml.template xltx
Intuition answered 20/1, 2016 at 19:3 Comment(1)
Please add some further explanation to your answer. Why should these lines in a htaccess modify the behaviour of PHP's fileinfo? How should this work if nginx is used, or a pure CLI application?Camus

© 2022 - 2024 — McMap. All rights reserved.