ZIP file format. How to read file properly?

Asked 28/12, 2017 at 14:53 Answered 8/7, 2023 at 11:48

I'm currently working on one Node.js project. I want to have an ability to read, modify and write ZIP file without saving it into FS (we receive it by TCP and send it back after modifications were made), and so far it looks like possible bocause of simple ZIP file structure. Currently I refer to this documentation.

So ZIP file has simple structure:

File header 1
File data 1
File data descriptor 1

File header 2
File data 2
File data descriptor 2

...

[other not important yet]

First we need to read file header, which contains field compressed size, and it could be the perfect way to read file data 1 by it's length. But it's actually not. This field may contain '0' or '0xFFFFFFFF', and those values don't describe its actual length. In that case we have to read file data without information about it's length. But how?..

Compression/Decopression algorithm descriptions looks pretty complex to me, and I plan to use ZLIB for compression itself anyway. So if something useful described there, then I missed the point.

Can someone explain the proper way to read those files?

P.S. Please avoid suggesting npm modules. I do not want to only solve the problem, but also to understand how things work.

Constellate answered 28/12, 2017 at 14:53 Comment(0)

Note - I'm assuming you want to read and process the zip file as it comes off the socket, rather than reading the complete zip file into memory before processing. Both options are valid.

I'd initially ignore the use cases where the compressed size has a value of '0' or '0xFFFFFFFF'. The former is only present in zip files created in streaming mode, the latter for zip files larger than 4Gig.

Dealing with them adds a lot of complexity - you can add support for them later, if necessary. Whether you ever need to support the 0/0xFFFFFFFF use cases depends on the nature of the zip files you intend to process.

When the compression method is deflated (8), use zlib for compression/decompression. You also need to support compression method stored (0). It gets used for very small files where compression isn't appropriate.

Preengage answered 29/12, 2017 at 11:10 Comment(4)

I'm so sorry that it took me too long to respond to your answer. I abandoned my ZIP project short after I asked my question, and just recently I came back to it, made some more research, and only after that noticed your answer. Currently I've made zip-reader for small archives, and I load all the file into memory (not read from stream as you mentioned), so I can read central directory in the end of file. I guess large files store their actual size in the end of file as well. If this is true, you can append this info, and I will accept your answer. – Constellate 29/7, 2019 at 15:53

If you are reading the complete zip file into memory then processing the central directory to get the data you need is the way to go. Large files (> 4gig) also store their actual size in the central directory, but the structure of the central directory is different. See the references to Zip64 in APPNOTE.txt for the details. – Preengage 30/7, 2019 at 10:11

My question title states on "how to read file properly". I guess answer should mention that proper way of reading it is from the end (central directory), and that we can also find info about large file records there. I was too amateur to understand it two years ago. So please update your answer to make other people of my level understand it without looking into comments, and I will gladly accept your answer. I could edit it by myself and accept it anyway, but I don't really think it's a correct way from ethical point of view. – Constellate 31/7, 2019 at 12:10

Feel free to edit the answer yourself. I have no problems with you doing that. – Preengage 1/8, 2019 at 8:9

Just spent the night resolving this. There's another block of headers at the end of zip file: Central Directory. They extend the info and, the most important, give us lengths of compressed blocks.

I find those headers by their 16-bit signatures (hope they most likely be unique and don't collide to any part of zipped contents). Then I parse them, and most of the resulting array data can be used for getting uncompressed info (this search is described in one of the @see - I've taken it as a basis, but it only works for those zips that contain lengths in pre-data header blocks).

Here is the code. The only note is that you should use the 'extrasLen' values from the local headers that are located before corresponding compressed data pieces - because same keys from Central Directory contain their specific numbers which differ and would give you wrong ranges.

Please refer @see links, especially "brief intro", to understand how it works.

<?php

class ZipHelper
{
    const METHOD_STORE = 0;     // no compression
    const METHOD_DEFLATED = 8;  // main for all zips
    const METHOD_DEFLATE64 = 9; // not supported by zlib

    const LOCAL_HEAD_LENGTH = 30;
    const LOCAL_HEAD_PARAMS = "Vsig/vver/vflag/vmethod/vmodTime/vmodDate/Vcrc/VcompSize/VrawSize/vnameLen/vextrasLen";

    const CENTRAL_DIR_LENGTH = 46;
    const CENTRAL_DIR_PARAMS = "Vsig/vverMadeBy/vverToExtract/vflag/vmethod/vmodTime/vmodDate/Vcrc/VcompSize/VrawSize/".
                               "vnameLen/vextrasLen/vcommLen/vdiskNumStart/vintFileAttr/VextFileAttr/VoffsetLocalHead";

    /**
     * @see https://mcmap.net/q/527770/-extract-a-file-from-a-zip-string
     * @see https://users.cs.jmu.edu/buchhofp/forensics/formats/pkzip.html - brief intro
     * @see https://pkware.cachefly.net/webdocs/APPNOTE/APPNOTE-6.3.9.TXT  - full specs
     *
     * Uses ext-zlib
     *
     * @param string $zippedContent
     * @return string[]     
     */
    public static function unzipOnAir(string $zippedContent): array
    {
        $result = [];
        $filesInfo = self::getZippedFilesInfoFromCentralDirectoryRecords($zippedContent);

        foreach ($filesInfo as $filename => $fileInfo) {
            $pos = $fileInfo['offsetLocalHead'];
            $head = unpack(self::LOCAL_HEAD_PARAMS, substr($zippedContent, $pos, self::LOCAL_HEAD_LENGTH));

            $pos += self::LOCAL_HEAD_LENGTH + $head['nameLen'] + $head['extrasLen']; // take from $head: it differs!
            $compressedData = substr($zippedContent, $pos, $fileInfo['compSize']);

            switch ($fileInfo['method']) {
                case self::METHOD_DEFLATED:
                    $unzipped = gzinflate($compressedData); 
                    break;
                case self::METHOD_STORE:                    
                    $unzipped = $compressedData;
                    break;
                case self::METHOD_DEFLATE64:  
                default:
                    $unzipped = false;
            }
            $result[$filename] = $unzipped;
        }

        return $result;
    }

    /**
     * @see https://users.cs.jmu.edu/buchhofp/forensics/formats/pkzip.html - brief intro
     * @see https://pkware.cachefly.net/webdocs/APPNOTE/APPNOTE-6.3.9.TXT  - full specs
     *
     * @param string $zippedContent
     * @return array[]
     */
    private static function getZippedFilesInfoFromCentralDirectoryRecords(string $zippedContent): array
    {
        $centralDirectoryRecordSignature = pack("V", '33639248');   // value is defined in ZIP standard
        $out = [];
        $pos = 0;

        while ($pos <= strlen($zippedContent)) {
            $pos = strpos($zippedContent, $centralDirectoryRecordSignature, $pos);
            if ($pos === false) break;

            $head = unpack(self::CENTRAL_DIR_PARAMS, substr($zippedContent, $pos, self::CENTRAL_DIR_LENGTH));
            $pos += self::CENTRAL_DIR_LENGTH;

            $filename = substr($zippedContent, $pos, $head['nameLen']);
            $out[$filename] = $head;

            $pos += $head['nameLen'] + $head['extrasLen'] + $head['commLen'];
        }

        return $out;
    }

}

Oleaster answered 8/7, 2023 at 11:48 Comment(0)

Recommended topics

Hot tags