Is there a convenient way to map a file uri to os.path?
Asked Answered
V

6

38

A subsystem which I have no control over insists on providing filesystem paths in the form of a uri. Is there a python module/function which can convert this path into the appropriate form expected by the filesystem in a platform independent manner?

Vivisect answered 12/5, 2011 at 11:56 Comment(2)
Are you going to be doing more than just reading from it?Insufferable
No, I want to pass that uri or equivalent form into the python modules for path manipulationVivisect
N
27

Use urllib.parse.urlparse to get the path from the URI:

import os
from urllib.parse import urlparse
p = urlparse('file://C:/test/doc.txt')
final_path = os.path.abspath(os.path.join(p.netloc, p.path))
Narcose answered 12/5, 2011 at 12:0 Comment(3)
@JakobBowyer - the .path in the second line should be removed otherwise you are just returning a string to variable p instead of the tuple that you need to process in the third line.Frontolysis
The valid file URI for C:\test\doc.txt is file:///C:/test/doc.txt not file://C:/test/doc.txt - see IETF RFC 8089: The "file" URI Scheme / 2. Syntax and run this in recent python 3 import pathlib; print(pathlib.PureWindowsPath("C:\\test\\doc.txt").as_uri()) so this answer is not accurate.Vanmeter
p.path should be unquoted, otherwise spaces in a filename will remain as %20. The final path should be: final_path = os.path.abspath(os.path.join(p.netloc, urllib.parse.unquote(p.path)))Biogen
A
31

The solution from @Jakob Bowyer doesn't convert URL encoded characters to regular UTF-8 characters. For that you need to use urllib.parse.unquote.

>>> from urllib.parse import unquote, urlparse
>>> unquote(urlparse('file:///home/user/some%20file.txt').path)
'/home/user/some file.txt'
Ass answered 24/2, 2013 at 3:4 Comment(4)
@IwanAucamp can you explain why?Citify
Consider using urllib.parse.unquote_plus which is "like unquote(), but also replace plus signs with spaces".Citify
@Boris because the path returned for windows file URIs starts with a slash unquote(urlparse('file:///C:/Program Files/Steam/').path) -> '/C:/Program Files/Steam/'Vanmeter
Perfect for my linux environmentGoshawk
N
27

Use urllib.parse.urlparse to get the path from the URI:

import os
from urllib.parse import urlparse
p = urlparse('file://C:/test/doc.txt')
final_path = os.path.abspath(os.path.join(p.netloc, p.path))
Narcose answered 12/5, 2011 at 12:0 Comment(3)
@JakobBowyer - the .path in the second line should be removed otherwise you are just returning a string to variable p instead of the tuple that you need to process in the third line.Frontolysis
The valid file URI for C:\test\doc.txt is file:///C:/test/doc.txt not file://C:/test/doc.txt - see IETF RFC 8089: The "file" URI Scheme / 2. Syntax and run this in recent python 3 import pathlib; print(pathlib.PureWindowsPath("C:\\test\\doc.txt").as_uri()) so this answer is not accurate.Vanmeter
p.path should be unquoted, otherwise spaces in a filename will remain as %20. The final path should be: final_path = os.path.abspath(os.path.join(p.netloc, urllib.parse.unquote(p.path)))Biogen
L
13

Of all the answers so far, I found none that catch edge cases, doesn't require branching, are both 2/3 compatible, and cross-platform.

In short, this does the job, using only builtins:

try:
    from urllib.parse import urlparse, unquote
    from urllib.request import url2pathname
except ImportError:
    # backwards compatability
    from urlparse import urlparse
    from urllib import unquote, url2pathname


def uri_to_path(uri):
    parsed = urlparse(uri)
    host = "{0}{0}{mnt}{0}".format(os.path.sep, mnt=parsed.netloc)
    return os.path.normpath(
        os.path.join(host, url2pathname(unquote(parsed.path)))
    )

The tricky bit (I found) was when working in Windows with paths specifying a host. This is a non-issue outside of Windows: network locations in *NIX can only be reached via paths after being mounted to the root of the filesystem.

From Wikipedia: A file URI takes the form of file://host/path , where host is the fully qualified domain name of the system on which the path is accessible [...]. If host is omitted, it is taken to be "localhost".

With that in mind, I make it a rule to ALWAYS prefix the path with the netloc provided by urlparse, before passing it to os.path.abspath, which is necessary as it removes any resulting redundant slashes (os.path.normpath, which also claims to fix the slashes, can get a little over-zealous in Windows, hence the use of abspath).

The other crucial component in the conversion is using unquote to escape/decode the URL percent-encoding, which your filesystem won't otherwise understand. Again, this might be a bigger issue on Windows, which allows things like $ and spaces in paths, which will have been encoded in the file URI.

For a demo:

import os
from pathlib import Path   # This demo requires pip install for Python < 3.4
import sys
try:
    from urllib.parse import urlparse, unquote
    from urllib.request import url2pathname
except ImportError:  # backwards compatability:
    from urlparse import urlparse
    from urllib import unquote, url2pathname

DIVIDER = "-" * 30

if sys.platform == "win32":  # WINDOWS
    filepaths = [
        r"C:\Python27\Scripts\pip.exe",
        r"C:\yikes\paths with spaces.txt",
        r"\\localhost\c$\WINDOWS\clock.avi",
        r"\\networkstorage\homes\rdekleer",
    ]
else:  # *NIX
    filepaths = [
        os.path.expanduser("~/.profile"),
        "/usr/share/python3/py3versions.py",
    ]

for path in filepaths:
    uri = Path(path).as_uri()
    parsed = urlparse(uri)
    host = "{0}{0}{mnt}{0}".format(os.path.sep, mnt=parsed.netloc)
    normpath = os.path.normpath(
        os.path.join(host, url2pathname(unquote(parsed.path)))
    )
    absolutized = os.path.abspath(
        os.path.join(host, url2pathname(unquote(parsed.path)))
    )
    result = ("{DIVIDER}"
              "\norig path:       \t{path}"
              "\nconverted to URI:\t{uri}"
              "\nrebuilt normpath:\t{normpath}"
              "\nrebuilt abspath:\t{absolutized}").format(**locals())
    print(result)
    assert path == absolutized

Results (WINDOWS):

------------------------------
orig path:              C:\Python27\Scripts\pip.exe
converted to URI:       file:///C:/Python27/Scripts/pip.exe
rebuilt normpath:       C:\Python27\Scripts\pip.exe
rebuilt abspath:        C:\Python27\Scripts\pip.exe
------------------------------
orig path:              C:\yikes\paths with spaces.txt
converted to URI:       file:///C:/yikes/paths%20with%20spaces.txt
rebuilt normpath:       C:\yikes\paths with spaces.txt
rebuilt abspath:        C:\yikes\paths with spaces.txt
------------------------------
orig path:              \\localhost\c$\WINDOWS\clock.avi
converted to URI:       file://localhost/c%24/WINDOWS/clock.avi
rebuilt normpath:       \localhost\c$\WINDOWS\clock.avi
rebuilt abspath:        \\localhost\c$\WINDOWS\clock.avi
------------------------------
orig path:              \\networkstorage\homes\rdekleer
converted to URI:       file://networkstorage/homes/rdekleer
rebuilt normpath:       \networkstorage\homes\rdekleer
rebuilt abspath:        \\networkstorage\homes\rdekleer

Results (*NIX):

------------------------------
orig path:              /home/rdekleer/.profile
converted to URI:       file:///home/rdekleer/.profile
rebuilt normpath:       /home/rdekleer/.profile
rebuilt abspath:        /home/rdekleer/.profile
------------------------------
orig path:              /usr/share/python3/py3versions.py
converted to URI:       file:///usr/share/python3/py3versions.py
rebuilt normpath:       /usr/share/python3/py3versions.py
rebuilt abspath:        /usr/share/python3/py3versions.py
Lyssa answered 20/5, 2020 at 20:41 Comment(2)
according to the documentation url2pathname uses unquote so url2pathname(parsed.path) should be sufficientSovereign
Your solution breaks when the encoded pathname includes urlencode-like characters. E.g. the filename foo%20bar.baz will be correctly encoded by your solution to foo%2520bar.baz, but incorrectly decoded to foo bar.baz. This happens because of the unmatched unquote inside url2pathname, as pointed out by @SovereignAegeus
V
9

To convert a file uri to a path with python (specific to 3, I can make for python 2 if someone really wants it):

  1. Parse the uri with urllib.parse.urlparse

  2. Unquote the path component of the parsed uri with urllib.parse.unquote

  3. then ...

a. If path is a windows path and starts with /: strip the first character of unquoted path component (path component of file:///C:/some/file.txt is /C:/some/file.txt which is not interpreted to be equivalent to C:\some\file.txt by pathlib.PureWindowsPath)

b. Otherwise just use the unquoted path component as is.

Here is a function that does this:

import urllib
import pathlib

def file_uri_to_path(file_uri, path_class=pathlib.PurePath):
    """
    This function returns a pathlib.PurePath object for the supplied file URI.

    :param str file_uri: The file URI ...
    :param class path_class: The type of path in the file_uri. By default it uses
        the system specific path pathlib.PurePath, to force a specific type of path
        pass pathlib.PureWindowsPath or pathlib.PurePosixPath
    :returns: the pathlib.PurePath object
    :rtype: pathlib.PurePath
    """
    windows_path = isinstance(path_class(),pathlib.PureWindowsPath)
    file_uri_parsed = urllib.parse.urlparse(file_uri)
    file_uri_path_unquoted = urllib.parse.unquote(file_uri_parsed.path)
    if windows_path and file_uri_path_unquoted.startswith("/"):
        result = path_class(file_uri_path_unquoted[1:])
    else:
        result = path_class(file_uri_path_unquoted)
    if result.is_absolute() == False:
        raise ValueError("Invalid file uri {} : resulting path {} not absolute".format(
            file_uri, result))
    return result

Usage examples (ran on linux):

>>> file_uri_to_path("file:///etc/hosts")
PurePosixPath('/etc/hosts')

>>> file_uri_to_path("file:///etc/hosts", pathlib.PurePosixPath)
PurePosixPath('/etc/hosts')

>>> file_uri_to_path("file:///C:/Program Files/Steam/", pathlib.PureWindowsPath)
PureWindowsPath('C:/Program Files/Steam')

>>> file_uri_to_path("file:/proc/cpuinfo", pathlib.PurePosixPath)
PurePosixPath('/proc/cpuinfo')

>>> file_uri_to_path("file:c:/system32/etc/hosts", pathlib.PureWindowsPath)
PureWindowsPath('c:/system32/etc/hosts')

This function works for windows and posix file URIs and it will handle file URIs without an authority section. It will however NOT do validation of the URI's authority so this will not be honoured:

IETF RFC 8089: The "file" URI Scheme / 2. Syntax

The "host" is the fully qualified domain name of the system on which the file is accessible. This allows a client on another system to know that it cannot access the file system, or perhaps that it needs to use some other local mechanism to access the file.

Validation (pytest) for the function:

import os
import pytest

def validate(file_uri, expected_windows_path, expected_posix_path):
    if expected_windows_path is not None:
        expected_windows_path_object = pathlib.PureWindowsPath(expected_windows_path)
    if expected_posix_path is not None:
        expected_posix_path_object = pathlib.PurePosixPath(expected_posix_path)

    if expected_windows_path is not None:
        if os.name == "nt":
            assert file_uri_to_path(file_uri) == expected_windows_path_object
        assert file_uri_to_path(file_uri, pathlib.PureWindowsPath) == expected_windows_path_object

    if expected_posix_path is not None:
        if os.name != "nt":
            assert file_uri_to_path(file_uri) == expected_posix_path_object
        assert file_uri_to_path(file_uri, pathlib.PurePosixPath) == expected_posix_path_object


def test_some_paths():
    validate(pathlib.PureWindowsPath(r"C:\Windows\System32\Drivers\etc\hosts").as_uri(),
        expected_windows_path=r"C:\Windows\System32\Drivers\etc\hosts",
        expected_posix_path=r"/C:/Windows/System32/Drivers/etc/hosts")

    validate(pathlib.PurePosixPath(r"/C:/Windows/System32/Drivers/etc/hosts").as_uri(),
        expected_windows_path=r"C:\Windows\System32\Drivers\etc\hosts",
        expected_posix_path=r"/C:/Windows/System32/Drivers/etc/hosts")

    validate(pathlib.PureWindowsPath(r"C:\some dir\some file").as_uri(),
        expected_windows_path=r"C:\some dir\some file",
        expected_posix_path=r"/C:/some dir/some file")

    validate(pathlib.PurePosixPath(r"/C:/some dir/some file").as_uri(),
        expected_windows_path=r"C:\some dir\some file",
        expected_posix_path=r"/C:/some dir/some file")

def test_invalid_url():
    with pytest.raises(ValueError) as excinfo:
        validate(r"file://C:/test/doc.txt",
            expected_windows_path=r"test\doc.txt",
            expected_posix_path=r"/test/doc.txt")
        assert "is not absolute" in str(excinfo.value)

def test_escaped():
    validate(r"file:///home/user/some%20file.txt",
        expected_windows_path=None,
        expected_posix_path=r"/home/user/some file.txt")
    validate(r"file:///C:/some%20dir/some%20file.txt",
        expected_windows_path="C:\some dir\some file.txt",
        expected_posix_path=r"/C:/some dir/some file.txt")

def test_no_authority():
    validate(r"file:c:/path/to/file",
        expected_windows_path=r"c:\path\to\file",
        expected_posix_path=None)
    validate(r"file:/path/to/file",
        expected_windows_path=None,
        expected_posix_path=r"/path/to/file")

This contribution is licensed (in addition to any other licenses which may apply) under the Zero-Clause BSD License (0BSD) license

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.


Public Domain

To the extent possible under law, Iwan Aucamp has waived all copyright and related or neighboring rights to this stackexchange contribution. This work is published from: Norway.

Vanmeter answered 12/8, 2019 at 14:30 Comment(0)
V
1

Starting from python 3.13, you can use the pathlib.Path.from_uri() method, a new constructor to create a pathlib.Path object from a ‘file’ URI (file://).

For example:

>>> p = Path.from_uri('file:////server/share')
WindowsPath('//server/share')
>>> p = Path.from_uri('file://///server/share')
WindowsPath('//server/share')
>>> p = Path.from_uri('file:c:/windows')
WindowsPath('c:/windows')
>>> p = Path.from_uri('file:/c|/windows')
WindowsPath('c:/windows')
Villegas answered 28/2, 2024 at 11:44 Comment(0)
I
0

The solution from @colton7909 is mostly correct and helped me get to this answer, but has some import errors with Python 3. That and I think this is a better way to deal with the 'file://' part of the URL than simply chopping off the first 7 characters. So I feel this is the most idiomatic way to do this using the standard library:

import urllib.parse
url_data = urllib.parse.urlparse('file:///home/user/some%20file.txt')
path = urllib.parse.unquote(url_data.path)

This example should produce the string '/home/user/some file.txt'

Infectious answered 4/5, 2019 at 22:58 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.