C#.net identify zip file
Asked Answered
J

6

20

I am currently using the SharpZip api to handle my zip file entries. It works splendid for zipping and unzipping. Though, I am having trouble identifying if a file is a zip or not. I need to know if there is a way to detect if a file stream can be decompressed. Originally I used

FileStream lFileStreamIn = File.OpenRead(mSourceFile);
lZipFile = new ZipFile(lFileStreamIn);
ZipInputStream lZipStreamTester = new ZipInputStream(lFileStreamIn, mBufferSize);// not working
lZipStreamTester.Read(lBuffer, 0, 0);
if (lZipStreamTester.CanDecompressEntry)
{

The LZipStreamTester becomes null every time and the if statement fails. I tried it with/without a buffer. Can anybody give any insight as to why? I am aware that i can check for file extension. I need something that is more definitive than that. I am also aware that zip has a magic #(PK something), but it isn't a guarantee that it will always be there because it isn't a requirement of the format.

Also i read about .net 4.5 having native zip support so my project may migrate to that instead of sharpzip but I still need didn't see a method/param similar to CanDecompressEntry here: http://msdn.microsoft.com/en-us/library/3z72378a%28v=vs.110%29

My last resort will be to use a try catch and attempt an unzip on the file.

Jus answered 16/8, 2012 at 22:23 Comment(2)
The simplest form of my question is this "In the code above, why does the if statement return false?"Jus
Magic IS mandatory per PKWARE specification.Homochromatic
G
21

This is a base class for a component that needs to handle data that is either uncompressed, PKZIP compressed (sharpziplib) or GZip compressed (built in .net). Perhaps a bit more than you need but should get you going. This is an example of using @PhonicUK's suggestion to parse the header of the data stream. The derived classes you see in the little factory method handled the specifics of PKZip and GZip decompression.

abstract class Expander
{
    private const int ZIP_LEAD_BYTES = 0x04034b50;
    private const ushort GZIP_LEAD_BYTES = 0x8b1f;

    public abstract MemoryStream Expand(Stream stream); 
    
    internal static bool IsPkZipCompressedData(byte[] data)
    {
        Debug.Assert(data != null && data.Length >= 4);
        // if the first 4 bytes of the array are the ZIP signature then it is compressed data
        return (BitConverter.ToInt32(data, 0) == ZIP_LEAD_BYTES);
    }

    internal static bool IsGZipCompressedData(byte[] data)
    {
        Debug.Assert(data != null && data.Length >= 2);
        // if the first 2 bytes of the array are theG ZIP signature then it is compressed data;
        return (BitConverter.ToUInt16(data, 0) == GZIP_LEAD_BYTES);
    }

    public static bool IsCompressedData(byte[] data)
    {
        return IsPkZipCompressedData(data) || IsGZipCompressedData(data);
    }

    public static Expander GetExpander(Stream stream)
    {
        Debug.Assert(stream != null);
        Debug.Assert(stream.CanSeek);
        stream.Seek(0, 0);

        try
        {
            byte[] bytes = new byte[4];

            stream.Read(bytes, 0, 4);

            if (IsGZipCompressedData(bytes))
                return new GZipExpander();

            if (IsPkZipCompressedData(bytes))
                return new ZipExpander();

            return new NullExpander();
        }
        finally
        {
            stream.Seek(0, 0);  // set the stream back to the begining
        }
    }
}
Gorga answered 16/8, 2012 at 22:39 Comment(3)
This is helpful but from the research I have done the PK file header or magic number is not a reliable way of determining if a file is zip. Thank you though.Jus
Haven't had trouble with that but this is from a system where the sources of the compressed data are well understood and controlled. Good luck!Gorga
I am going to have to do an audit on our file system to be sure. I believe the mix of a PK magic number check, file extension, and try catch on unzip will be enough. We originally wanted to avoid using a try catch to determine if the file was a zip but it has to be in there. Even if we assume a zip on magic number we still need to try catch to determine if the zip is corrupted. I wish I could rep you but too noob right now. We also have reworked how we will upload files to remove some of the ambiguity. Thanks again.Jus
S
13

View https://mcmap.net/q/613007/-c-net-identify-zip-file reference

Check the below links:

icsharpcode-sharpziplib-validate-zip-file

How-to-check-if-a-file-is-compressed-in-c#

ZIP files always start with 0x04034b50 (4 bytes)
View more: http://en.wikipedia.org/wiki/Zip_(file_format)#File_headers

Sample usage:

        bool isPKZip = IOHelper.CheckSignature(pkg, 4, IOHelper.SignatureZip);
        Assert.IsTrue(isPKZip, "Not ZIP the package : " + pkg);

// http://blog.somecreativity.com/2008/04/08/how-to-check-if-a-file-is-compressed-in-c/
    public static partial class IOHelper
    {
        public const string SignatureGzip = "1F-8B-08";
        public const string SignatureZip = "50-4B-03-04";

        public static bool CheckSignature(string filepath, int signatureSize, string expectedSignature)
        {
            if (String.IsNullOrEmpty(filepath)) throw new ArgumentException("Must specify a filepath");
            if (String.IsNullOrEmpty(expectedSignature)) throw new ArgumentException("Must specify a value for the expected file signature");
            using (FileStream fs = new FileStream(filepath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
            {
                if (fs.Length < signatureSize)
                    return false;
                byte[] signature = new byte[signatureSize];
                int bytesRequired = signatureSize;
                int index = 0;
                while (bytesRequired > 0)
                {
                    int bytesRead = fs.Read(signature, index, bytesRequired);
                    bytesRequired -= bytesRead;
                    index += bytesRead;
                }
                string actualSignature = BitConverter.ToString(signature);
                if (actualSignature == expectedSignature) return true;
                return false;
            }
        }

    }
Somebody answered 19/5, 2014 at 12:38 Comment(0)
P
9

You can either:

  • Use a try-catch structure and try to read the structure of a potential zip file
  • Parse the file header to see if it is a zip file

ZIP files always start with 0x04034b50 as its first 4 bytes ( http://en.wikipedia.org/wiki/Zip_(file_format)#File_headers )

Particiaparticipant answered 16/8, 2012 at 22:26 Comment(3)
Wikipedia and other sources have listed that the magic number is not a good way to determine if a file is a zip because it is not required for the file format of a zip. Alternatively, we would like to avoid try catching but that is the current method.Jus
Note: 0x04034B50 is little endian, so the first byte of the file is 0x50, the second one is 0x4B and so on...Avifauna
@Sean Dunford. No. Magic number IS the good way to check if a file is a zip. If the magic is present there are good chances the file is really a zip. If the magic is gone then the file should not be considered as a zip because magic IS mandatory. If it still parse as a zip despite missing header then author tried to fool you.Homochromatic
O
3

I used https://en.wikipedia.org/wiki/List_of_file_signatures, just adding an extra byte on for my zip files, to differentiate between my zip files and Word documents (these share the first four bytes).

Here is my code:

public class ZipFileUtilities
{
    private static readonly byte[] ZipBytes1 = { 0x50, 0x4b, 0x03, 0x04, 0x0a };
    private static readonly byte[] GzipBytes = { 0x1f, 0x8b };
    private static readonly byte[] TarBytes = { 0x1f, 0x9d };
    private static readonly byte[] LzhBytes = { 0x1f, 0xa0 };
    private static readonly byte[] Bzip2Bytes = { 0x42, 0x5a, 0x68 };
    private static readonly byte[] LzipBytes = { 0x4c, 0x5a, 0x49, 0x50 };
    private static readonly byte[] ZipBytes2 = { 0x50, 0x4b, 0x05, 0x06 };
    private static readonly byte[] ZipBytes3 = { 0x50, 0x4b, 0x07, 0x08 };

    public static byte[] GetFirstBytes(string filepath, int length)
    {
        using (var sr = new StreamReader(filepath))
        {
            sr.BaseStream.Seek(0, 0);
            var bytes = new byte[length];
            sr.BaseStream.Read(bytes, 0, length);

            return bytes;
        }
    }

    public static bool IsZipFile(string filepath)
    {
        return IsCompressedData(GetFirstBytes(filepath, 5));
    }

    public static bool IsCompressedData(byte[] data)
    {
        foreach (var headerBytes in new[] { ZipBytes1, ZipBytes2, ZipBytes3, GzipBytes, TarBytes, LzhBytes, Bzip2Bytes, LzipBytes })
        {
            if (HeaderBytesMatch(headerBytes, data))
                return true;
        }

        return false;
    }

    private static bool HeaderBytesMatch(byte[] headerBytes, byte[] dataBytes)
    {
        if (dataBytes.Length < headerBytes.Length)
            throw new ArgumentOutOfRangeException(nameof(dataBytes), 
                $"Passed databytes length ({dataBytes.Length}) is shorter than the headerbytes ({headerBytes.Length})");

        for (var i = 0; i < headerBytes.Length; i++)
        {
            if (headerBytes[i] == dataBytes[i]) continue;

            return false;
        }

        return true;
    }

 }

There may be better ways to code this particularly the byte compare, but as its a variable length byte compare (depending on the signature being checked), I felt at least this code is readable - to me at least.

Ofelia answered 11/3, 2019 at 13:58 Comment(0)
L
2

If you are programming for Web, you can check the file Content Type: application/zip

Luciferase answered 16/8, 2012 at 22:36 Comment(0)
M
1

Thanks to dkackman and Kiquenet for answers above. For completeness, the below code uses the signature to identify compressed (zip) files. You then have the added complexity that the newer MS Office file formats will also return match this signature lookup (your .docx and .xlsx files etc). As remarked upon elsewhere, these are indeed compressed archives, you can rename the files with a .zip extension and have a look at the XML inside.

Below code, first does a check for ZIP (compressed) using the signatures used above, and we then have a subsequent check for the MS Office packages. Note that to use the System.IO.Packaging.Package you need a project reference to "WindowsBase" (that is a .NET assembly reference).

    private const string SignatureZip = "50-4B-03-04";
    private const string SignatureGzip = "1F-8B-08";

    public static bool IsZip(this Stream stream)
    {
        if (stream.Position > 0)
        {
            stream.Seek(0, SeekOrigin.Begin);
        }

        bool isZip = CheckSignature(stream, 4, SignatureZip);
        bool isGzip = CheckSignature(stream, 3, SignatureGzip);

        bool isSomeKindOfZip = isZip || isGzip;

        if (isSomeKindOfZip && stream.IsPackage()) //Signature matches ZIP, but it's package format (docx etc).
        {
            return false;
        }

        return isSomeKindOfZip;
    }

    /// <summary>
    /// MS .docx, .xslx and other extensions are (correctly) identified as zip files using signature lookup.
    /// This tests if System.IO.Packaging is able to open, and if package has parts, this is not a zip file.
    /// </summary>
    /// <param name="stream"></param>
    /// <returns></returns>
    private static bool IsPackage(this Stream stream)
    {
        Package package = Package.Open(stream, FileMode.Open, FileAccess.Read);
        return package.GetParts().Any();
    }
Merely answered 16/6, 2018 at 10:52 Comment(1)
doesn't this fail when a file is gzipped at Package.Open()?Transonic

© 2022 - 2024 — McMap. All rights reserved.