Safely extract zip or tar using Python
Asked Answered
D

5

31

I'm trying to extract user-submitted zip and tar files to a directory. The documentation for zipfile's extractall method (similarly with tarfile's extractall) states that it's possible for paths to be absolute or contain .. paths that go outside the destination path. Instead, I could use extract myself, like this:

some_path = '/destination/path'
some_zip = '/some/file.zip'
zipf = zipfile.ZipFile(some_zip, mode='r')
for subfile in zipf.namelist():
    zipf.extract(subfile, some_path)

Is this safe? Is it possible for a file in the archive to wind up outside of some_path in this case? If so, what way can I ensure that files will never wind up outside the destination directory?

Dun answered 8/4, 2012 at 3:7 Comment(1)
Starting with python 2.7.4, the method zipfile.extract() prohibits the creation of files outside the sandbox. So, this method is now safe as of python 2.7.4. The vulnerability still exists for tar archives, however.Drees
D
45

Note: Starting with python 2.7.4, this is a non-issue for ZIP archives. Details at the bottom of the answer. This answer focuses on tar archives.

To figure out where a path really points to, use os.path.abspath() (but note the caveat about symlinks as path components). If you normalize a path from your zipfile with abspath and it does not contain the current directory as a prefix, it's pointing outside it.

But you also need to check the value of any symlink extracted from your archive (both tarfiles and unix zipfiles can store symlinks). This is important if you are worried about a proverbial "malicious user" that would intentionally bypass your security, rather than an application that simply installs itself in system libraries.

That's the aforementioned caveat: abspath will be misled if your sandbox already contains a symlink that points to a directory. Even a symlink that points within the sandbox can be dangerous: The symlink sandbox/subdir/foo -> .. points to sandbox, so the path sandbox/subdir/foo/../.bashrc should be disallowed. The easiest way to do so is to wait until the previous files have been extracted and use os.path.realpath(). Fortunately extractall() accepts a generator, so this is easy to do.

Since you ask for code, here's a bit that explicates the algorithm. It prohibits not only the extraction of files to locations outside the sandbox (which is what was requested), but also the creation of links inside the sandbox that point to locations outside the sandbox. I'm curious to hear if anyone can sneak any stray files or links past it.

import tarfile
from os.path import abspath, realpath, dirname, join as joinpath
from sys import stderr

resolved = lambda x: realpath(abspath(x))

def badpath(path, base):
    # joinpath will ignore base if path is absolute
    return not resolved(joinpath(base,path)).startswith(base)

def badlink(info, base):
    # Links are interpreted relative to the directory containing the link
    tip = resolved(joinpath(base, dirname(info.name)))
    return badpath(info.linkname, base=tip)

def safemembers(members):
    base = resolved(".")
    
    for finfo in members:
        if badpath(finfo.name, base):
            print >>stderr, finfo.name, "is blocked (illegal path)"
        elif finfo.issym() and badlink(finfo,base):
            print >>stderr, finfo.name, "is blocked: Symlink to", finfo.linkname
        elif finfo.islnk() and badlink(finfo,base):
            print >>stderr, finfo.name, "is blocked: Hard link to", finfo.linkname
        else:
            yield finfo

ar = tarfile.open("testtar.tar")
ar.extractall(path="./sandbox", members=safemembers(ar))
ar.close()

Edit: Starting with python 2.7.4, this is a non-issue for ZIP archives: The method zipfile.extract() prohibits the creation of files outside the sandbox:

Note: If a member filename is an absolute path, a drive/UNC sharepoint and leading (back)slashes will be stripped, e.g.: ///foo/bar becomes foo/bar on Unix, and C:\foo\bar becomes foo\bar on Windows. And all ".." components in a member filename will be removed, e.g.: ../../foo../../ba..r becomes foo../ba..r. On Windows, illegal characters (:, <, >, |, ", ?, and *) [are] replaced by underscore (_).

The tarfile class has not been similarly sanitized, so the above answer still apllies.

Drees answered 9/4, 2012 at 17:44 Comment(9)
You can assume the new sandbox directory is emptyDun
I thought as much; but you still need to watch out for the exploit I outlined: First the archive contains a symlink to another directory, then a file that uses the symlink as its path.Drees
realpath will convert the extracted file into its real path, so you could probably just check that after extraction?Dun
Right, you can use realpath to test every symlink immediately after extracting it (which means you can't use extractall to unzip the archive, since you need to check after extracting each file).Drees
Thanks for this answer. I've awarded you the bounty, but I'll leave the question unanswered for now, just because this is tar-specific. The zipfile module doesn't have methods to tell whether a file is a symlink.Dun
This really should be supported by default :/Aspa
According to the readme, Archive.extract() will raise an exception if it detects an out-of-bounds file. The exception will terminate the bulk extraction, and there's no way to resume it. There doesn't even seem to be a way to list the archive contents and extract one file at a time. Color me unimpressed.Drees
I think the print messages for symlinks and hard links should be swapped.Marutani
Oops, good catch @AlbDrees
P
4

Contrary to the popular answer, unzipping files safely is not completely solved as of Python 2.7.4. The extractall method is still dangerous and can lead to path traversal, either directly or through the unzipping of symbolic links. Here was my final solution which should prevent both attacks in all versions of Python, even versions prior to Python 2.7.4 where the extract method was vulnerable:

import zipfile, os

def safe_unzip(zip_file, extract_path='.'):
    with zipfile.ZipFile(zip_file, 'r') as zf:
        for member in zf.infolist():
            file_path = os.path.realpath(os.path.join(extract_path, member.filename))
            if file_path.startswith(os.path.realpath(extract_path)):
                zf.extract(member, extract_path)

Edit 1: Fixed variable name clash. Thanks Juuso Ohtonen.

Edit 2: s/abspath/realpath/g. Thanks TheLizzard

Psychosocial answered 12/4, 2016 at 20:53 Comment(4)
Avoid using zipfile as parameter name since it conflicts the import name: AttributeError: 'str' object has no attribute 'ZipFile'. Fix is to rename zipfile parameter to e.g. zip_file.Fictive
Thanks, for the comment. I fixed the sample code. I originally grabbed it out of my project and edited it to be stand-alone and clearly forgot to test it.Psychosocial
Why are you using os.path.abspath not os.path.realpath? Wouldn't it be more safe to use os.path.realpath?Metropolitan
Good point. I'll update the answer to reflect that suggestion. realpath apparently calls abspath, so realpath should be sufficient.Psychosocial
L
3

Use ZipFile.infolist()/TarFile.next()/TarFile.getmembers() to get the information about each entry in the archive, normalize the path, open the file yourself, use ZipFile.open()/TarFile.extractfile() to get a file-like for the entry, and copy the entry data yourself.

Latoyalatoye answered 8/4, 2012 at 3:19 Comment(6)
This seems really tricky to make sure I get right - especially if you have files like ../../../../subdir/../../something/file.txt - where should the destination be? No one has made code available to deal with this before?Dun
No one can answer that for you, since only you understand your application requirements.Latoyalatoye
I disagree. Other tools do this automatically for you - for example the tar command automatically gets rid of absolute paths unless you specify --absolute-names.Dun
And any software that delegates to tar has to abide by that. This is your software.Latoyalatoye
sigh When you come across an entry with an invalid/disallowed path you have 3 options: 1) attempt extraction anyway, and catch any errors 2) extract to a modified path 3) don't extract. I can't tell you which policy is appropriate for your application.Latoyalatoye
@IgnacioVazquez-Abrams: Sure but why doesn't Python give you those options? It clearly could. And why is the default option pretty clearly the absolute worst?Biology
C
3

Copy the zipfile to an empty directory. Then use os.chroot to make that directory the root directory. Then unzip there.

Alternatively, you can call unzip itself with the -j flag, which ignores the directories:

import subprocess
filename = '/some/file.zip'
rv = subprocess.call(['unzip', '-j', filename])
Coburg answered 15/4, 2012 at 11:57 Comment(4)
The subprocess module works on every platform that runs Python, AFAICT. But if you are talking about MS Windows, There are several programs for handling zipfiles available for it, like INFO-zip. The specific command line would of course need to be adapted for the program you wish to use.Coburg
You're right, os.chroot is specific to UNIX. But if you search for them you'll find chroot like applications for windows. Of course the real overkill solution in this case would be to run unzip in a virtual machine. :-)Coburg
That's a brilliantly simple idea, but (a) it only really works on Unix systems, and (b) on Unix, only the superuser can chroot. Privilege escalation in the midst of dealing with potentially unsafe data is really the wrong way to go...Drees
Using the -j flag of info-zip's unzip as an alternative of chroot should work on any platform that unzip works on.Coburg
A
2

PSA: The accepted answer to this question is out of date!

As of Python release 3.11.4 there is an extraction filter mechanism included in tarfile.TarFile.extractall(). This mechanism, when using the data filter, will ensure safe extraction of tarballs in most cases (including CVE-2007-4559).

If you have the ability, you should use a version of python >=3.11.4 when processing untrusted tar files so as to avail yourself of the provided security features. The accepted answer should be implemented iff you can't use the language feature for this purpose.

fair thee well fellow exhausted engineer...

Aranda answered 12/9, 2023 at 18:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.