Python: Get URL path sections

Asked 25/10, 2011 at 18:57 Answered 14/5, 2019 at 14:50

How do I get specific path sections from a url? For example, I want a function which operates on this:

http://www.mydomain.com/hithere?image=2934

and returns "hithere"

or operates on this:

http://www.mydomain.com/hithere/something/else

and returns the same thing ("hithere")

I know this will probably use urllib or urllib2 but I can't figure out from the docs how to get only a section of the path.

Ballenger answered 25/10, 2011 at 18:57 Comment(3)

The URL syntax is something like: scheme://domain:port/path?query_string#fragment_id, so 'hithere' is the whole path in the first case and 1 section of the it in the second. Just urlparse it then 'hithere' is going to be path.split('/')[1] – Virtuoso 25/10, 2011 at 19:25

wouldn't it be path.split('/')[0]? (the first item of the list) – Ballenger 25/10, 2011 at 19:28

No, because the path starts with a '/' so [0] is an empty string. I.e. ideone.com/hJRxk – Virtuoso 25/10, 2011 at 19:33

Extract the path component of the URL with urlparse (Python 2.7):

import urlparse
path = urlparse.urlparse('http://www.example.com/hithere/something/else').path
print path
> '/hithere/something/else'

or urllib.parse (Python 3):

import urllib.parse
path = urllib.parse.urlparse('http://www.example.com/hithere/something/else').path
print(path)
> '/hithere/something/else'

Split the path into components with os.path.split:

>>> import os.path
>>> os.path.split(path)
('/hithere/something', 'else')

The dirname and basename functions give you the two pieces of the split; perhaps use dirname in a while loop:

>>> while os.path.dirname(path) != '/':
...     path = os.path.dirname(path)
... 
>>> path
'/hithere'

Circumrotate answered 25/10, 2011 at 19:6 Comment(9)

Does urllib not have any function that can do this without doing a bunch of string parsing/splitting/looping? I thought there'd be a shortcut... – Ballenger 25/10, 2011 at 19:19

how does urlparse.parse compare with cgi.parse? – Ballenger 25/10, 2011 at 19:58

urlparse.parse didn't work (at least on Python 2.7). You should use urlparse.urlparse – Apiary 20/3, 2013 at 17:48

Don't use os.path.split for urls as it is platform dependent. That code will fail on Windows because it expects \ as a delimiter! – Switchback 29/6, 2013 at 5:25

@Switchback This is incorrect. I just tested. It would be wrong to use os.path.join since it would use the wrong delimiter, but the split method can still split on /. In fact, you can type all your directory paths for Windows using / as the directory separator in Python. Using / as the directory separator works in a lot of places on Windows, not just in Python. – Grantinaid 29/6, 2013 at 6:14

os.path.split may happen to work but I think it would be bad practice to use it here, as it is clearly intended for os paths and not url paths. – Acicula 3/5, 2014 at 20:6

using os.path will fail for URLs containing \ on Windows. Use posixpath instead - see my answer. – Gresham 26/8, 2014 at 1:56

Why not just use path.split("/")? – Satire 30/3, 2016 at 9:13

filter(bool, path.split("/")), in case there is a trailing "/" char – Satire 30/3, 2016 at 9:21

Python 3.4+ solution:

from urllib.parse import unquote, urlparse
from pathlib import PurePosixPath

url = 'http://www.example.com/hithere/something/else'

PurePosixPath(
    unquote(
        urlparse(
            url
        ).path
    )
).parts[1]

# returns 'hithere' (the same for the URL with parameters)

# parts holds ('/', 'hithere', 'something', 'else')
#               0    1          2            3

Adapter answered 31/10, 2018 at 4:5 Comment(0)

The best option is to use the posixpath module when working with the path component of URLs. This module has the same interface as os.path and consistently operates on POSIX paths when used on POSIX and Windows NT based platforms.

Sample Code:

#!/usr/bin/env python3

import urllib.parse
import sys
import posixpath
import ntpath
import json

def path_parse( path_string, *, normalize = True, module = posixpath ):
    result = []
    if normalize:
        tmp = module.normpath( path_string )
    else:
        tmp = path_string
    while tmp != "/":
        ( tmp, item ) = module.split( tmp )
        result.insert( 0, item )
    return result

def dump_array( array ):
    string = "[ "
    for index, item in enumerate( array ):
        if index > 0:
            string += ", "
        string += "\"{}\"".format( item )
    string += " ]"
    return string

def test_url( url, *, normalize = True, module = posixpath ):
    url_parsed = urllib.parse.urlparse( url )
    path_parsed = path_parse( urllib.parse.unquote( url_parsed.path ),
        normalize=normalize, module=module )
    sys.stdout.write( "{}\n  --[n={},m={}]-->\n    {}\n".format( 
        url, normalize, module.__name__, dump_array( path_parsed ) ) )

test_url( "http://eg.com/hithere/something/else" )
test_url( "http://eg.com/hithere/something/else/" )
test_url( "http://eg.com/hithere/something/else/", normalize = False )
test_url( "http://eg.com/hithere/../else" )
test_url( "http://eg.com/hithere/../else", normalize = False )
test_url( "http://eg.com/hithere/../../else" )
test_url( "http://eg.com/hithere/../../else", normalize = False )
test_url( "http://eg.com/hithere/something/./else" )
test_url( "http://eg.com/hithere/something/./else", normalize = False )
test_url( "http://eg.com/hithere/something/./else/./" )
test_url( "http://eg.com/hithere/something/./else/./", normalize = False )

test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False )
test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False,
    module = ntpath )

Code output:

http://eg.com/hithere/something/else
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
  --[n=False,m=posixpath]-->
    [ "hithere", "something", "else", "" ]
http://eg.com/hithere/../else
  --[n=True,m=posixpath]-->
    [ "else" ]
http://eg.com/hithere/../else
  --[n=False,m=posixpath]-->
    [ "hithere", "..", "else" ]
http://eg.com/hithere/../../else
  --[n=True,m=posixpath]-->
    [ "else" ]
http://eg.com/hithere/../../else
  --[n=False,m=posixpath]-->
    [ "hithere", "..", "..", "else" ]
http://eg.com/hithere/something/./else
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else
  --[n=False,m=posixpath]-->
    [ "hithere", "something", ".", "else" ]
http://eg.com/hithere/something/./else/./
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else/./
  --[n=False,m=posixpath]-->
    [ "hithere", "something", ".", "else", ".", "" ]
http://eg.com/see%5C/if%5C/this%5C/works
  --[n=False,m=posixpath]-->
    [ "see\", "if\", "this\", "works" ]
http://eg.com/see%5C/if%5C/this%5C/works
  --[n=False,m=ntpath]-->
    [ "see", "if", "this", "works" ]

Notes:

On Windows NT based platforms os.path is ntpath
On Unix/Posix based platforms os.path is posixpath
ntpath will not handle backslashes (\) correctly (see last two cases in code/output) - which is why posixpath is recommended.
remember to use urllib.parse.unquote
consider using posixpath.normpath
The semantics of multiple path separators (/) is not defined by RFC 3986. However, posixpath collapses multiple adjacent path separators (i.e. it treats ///, // and / the same)
Even though POSIX and URL paths have similar syntax and semantics, they are not identical.

Normative References:

Gresham answered 26/8, 2014 at 0:33 Comment(3)

Python 3.4+ solution: url_path = PurePosixPath(urllib.parse.unquote(urllib.parse.urlparse(url‌).path)). – Adapter 30/1, 2018 at 8:29

@Adapter worthwhile to post this as an answer – Parette 27/10, 2018 at 21:20

Great answer. However this fails if there is an error in one of the scrapped url. For example: test_url( "http://eg.com/hithere//something/else" ) will lead to an infinite loop on while tmp != "/": – Watterson 11/11, 2019 at 21:53

Note in Python3 import has changed to from urllib.parse import urlparse See documentation. Here is an example:

>>> from urllib.parse import urlparse
>>> url = 's3://bucket.test/my/file/directory'
>>> p = urlparse(url)
>>> p
ParseResult(scheme='s3', netloc='bucket.test', path='/my/file/directory', params='', query='', fragment='')
>>> p.scheme
's3'
>>> p.netloc
'bucket.test'
>>> p.path
'/my/file/directory'

Camelopardus answered 8/12, 2018 at 15:31 Comment(0)

import urlparse

output = urlparse.urlparse('http://www.example.com/temp/something/happen/index.html').path

output

'/temp/something/happen/index.html'

Split the path -- inbuilt rpartition func of string 

output.rpartition('/')[0]

'/temp/something/happen'

Chorion answered 24/8, 2016 at 10:0 Comment(0)

Here is an example using urlparse and rpartition.

# Python 2x:
from urlparse import urlparse
# Python 3x:
from urllib.parse import urlparse

def printPathTokens(full_url):
    print('printPathTokens() called: %s' % full_url)

    p_full = urlparse(full_url).path

    print(' . p_full url: %s' % p_full)

    # Split the path using rpartition method of string
    # rpartition "returns a tuple containing the part the before separator,
    # argument string and the part after the separator" 
    (rp_left, rp_match, rp_right) = p_full.rpartition('/')

    if rp_match == '': # returns the rpartition separator if found
        print(' . No slashes found in path')
    else:
        print(' . path to last resource: %s' % rp_left)
        if rp_right == '': # Ended with a slash
            print(' . last resource: (none)')
        else:
            print(' . last resource: %s' % (rp_right))


printPathTokens('http://www.example.com/temp/something/happen/index.html')
# Output:
# printPathTokens() called: http://www.example.com/temp/something/happen/index.html
# . p_full url: /temp/something/happen/index.html
# . path to last resource: /temp/something/happen
# . last resource: index.html

printPathTokens('http://www.example.com/temp/something/happen/')
# Output:
# printPathTokens() called: http://www.example.com/temp/something/happen/
# . p_full url: /temp/something/happen/
# . path to last resource: /temp/something/happen
# . last resource: (none)

printPathTokens('http://www.example.com/temp/something/happen')
# Output:
# printPathTokens() called: http://www.example.com/temp/something/happen
# . p_full url: /temp/something/happen
# . path to last resource: /temp/something
# . last resource: happen

Janitajanith answered 14/5, 2019 at 14:50 Comment(0)

A combination of urlparse and os.path.split will do the trick. The following script stores all sections of a url in a list, backwards.

import os.path, urlparse

def generate_sections_of_url(url):
    path = urlparse.urlparse(url).path
    sections = []; temp = "";
    while path != '/':
        temp = os.path.split(path)
        path = temp[0]
        sections.append(temp[1])
    return sections

This would return: ["else", "something", "hithere"]

Softwood answered 29/1, 2016 at 15:53 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags