Note: Starting with python 2.7.4, this is a non-issue for ZIP archives. Details at the bottom of the answer. This answer focuses on tar archives.
To figure out where a path really points to, use os.path.abspath()
(but note the caveat about symlinks as path components). If you normalize a path from your zipfile with abspath
and it does not contain the current directory as a prefix, it's pointing outside it.
But you also need to check the value of any symlink extracted from your archive (both tarfiles and unix zipfiles can store symlinks). This is important if you are worried about a proverbial "malicious user" that would intentionally bypass your security, rather than an application that simply installs itself in system libraries.
That's the aforementioned caveat: abspath
will be misled if your sandbox already contains a symlink that points to a directory. Even a symlink that points within the sandbox can be dangerous: The symlink sandbox/subdir/foo -> ..
points to sandbox
, so the path sandbox/subdir/foo/../.bashrc
should be disallowed. The easiest way to do so is to wait until the previous files have been extracted and use os.path.realpath()
. Fortunately extractall()
accepts a generator, so this is easy to do.
Since you ask for code, here's a bit that explicates the algorithm. It prohibits not only the extraction of files to locations outside the sandbox (which is what was requested), but also the creation of links inside the sandbox that point to locations outside the sandbox. I'm curious to hear if anyone can sneak any stray files or links past it.
import tarfile
from os.path import abspath, realpath, dirname, join as joinpath
from sys import stderr
resolved = lambda x: realpath(abspath(x))
def badpath(path, base):
# joinpath will ignore base if path is absolute
return not resolved(joinpath(base,path)).startswith(base)
def badlink(info, base):
# Links are interpreted relative to the directory containing the link
tip = resolved(joinpath(base, dirname(info.name)))
return badpath(info.linkname, base=tip)
def safemembers(members):
base = resolved(".")
for finfo in members:
if badpath(finfo.name, base):
print >>stderr, finfo.name, "is blocked (illegal path)"
elif finfo.issym() and badlink(finfo,base):
print >>stderr, finfo.name, "is blocked: Symlink to", finfo.linkname
elif finfo.islnk() and badlink(finfo,base):
print >>stderr, finfo.name, "is blocked: Hard link to", finfo.linkname
else:
yield finfo
ar = tarfile.open("testtar.tar")
ar.extractall(path="./sandbox", members=safemembers(ar))
ar.close()
Edit: Starting with python 2.7.4, this is a non-issue for ZIP archives: The method zipfile.extract()
prohibits the creation of files outside the sandbox:
Note: If a member filename is an absolute path, a drive/UNC sharepoint and leading (back)slashes will be stripped, e.g.: ///foo/bar
becomes foo/bar
on Unix, and C:\foo\bar
becomes foo\bar
on Windows. And all ".."
components in a member filename will be removed, e.g.: ../../foo../../ba..r
becomes foo../ba..r
. On Windows, illegal characters (:
, <
, >
, |
, "
, ?
, and *
) [are] replaced by underscore (_).
The tarfile
class has not been similarly sanitized, so the above answer still apllies.
zipfile.extract()
prohibits the creation of files outside the sandbox. So, this method is now safe as of python 2.7.4. The vulnerability still exists for tar archives, however. – Drees