Python: Get URL path sections
Asked Answered
B

7

69

How do I get specific path sections from a url? For example, I want a function which operates on this:

http://www.mydomain.com/hithere?image=2934

and returns "hithere"

or operates on this:

http://www.mydomain.com/hithere/something/else

and returns the same thing ("hithere")

I know this will probably use urllib or urllib2 but I can't figure out from the docs how to get only a section of the path.

Ballenger answered 25/10, 2011 at 18:57 Comment(3)
The URL syntax is something like: scheme://domain:port/path?query_string#fragment_id, so 'hithere' is the whole path in the first case and 1 section of the it in the second. Just urlparse it then 'hithere' is going to be path.split('/')[1]Virtuoso
wouldn't it be path.split('/')[0]? (the first item of the list)Ballenger
No, because the path starts with a '/' so [0] is an empty string. I.e. ideone.com/hJRxkVirtuoso
C
69

Extract the path component of the URL with urlparse (Python 2.7):

import urlparse
path = urlparse.urlparse('http://www.example.com/hithere/something/else').path
print path
> '/hithere/something/else'

or urllib.parse (Python 3):

import urllib.parse
path = urllib.parse.urlparse('http://www.example.com/hithere/something/else').path
print(path)
> '/hithere/something/else'

Split the path into components with os.path.split:

>>> import os.path
>>> os.path.split(path)
('/hithere/something', 'else')

The dirname and basename functions give you the two pieces of the split; perhaps use dirname in a while loop:

>>> while os.path.dirname(path) != '/':
...     path = os.path.dirname(path)
... 
>>> path
'/hithere'
Circumrotate answered 25/10, 2011 at 19:6 Comment(9)
Does urllib not have any function that can do this without doing a bunch of string parsing/splitting/looping? I thought there'd be a shortcut...Ballenger
how does urlparse.parse compare with cgi.parse?Ballenger
urlparse.parse didn't work (at least on Python 2.7). You should use urlparse.urlparseApiary
Don't use os.path.split for urls as it is platform dependent. That code will fail on Windows because it expects \ as a delimiter!Switchback
@Switchback This is incorrect. I just tested. It would be wrong to use os.path.join since it would use the wrong delimiter, but the split method can still split on /. In fact, you can type all your directory paths for Windows using / as the directory separator in Python. Using / as the directory separator works in a lot of places on Windows, not just in Python.Grantinaid
os.path.split may happen to work but I think it would be bad practice to use it here, as it is clearly intended for os paths and not url paths.Acicula
using os.path will fail for URLs containing \ on Windows. Use posixpath instead - see my answer.Gresham
Why not just use path.split("/")?Satire
filter(bool, path.split("/")), in case there is a trailing "/" charSatire
A
54

Python 3.4+ solution:

from urllib.parse import unquote, urlparse
from pathlib import PurePosixPath

url = 'http://www.example.com/hithere/something/else'

PurePosixPath(
    unquote(
        urlparse(
            url
        ).path
    )
).parts[1]

# returns 'hithere' (the same for the URL with parameters)

# parts holds ('/', 'hithere', 'something', 'else')
#               0    1          2            3

Adapter answered 31/10, 2018 at 4:5 Comment(0)
G
26

The best option is to use the posixpath module when working with the path component of URLs. This module has the same interface as os.path and consistently operates on POSIX paths when used on POSIX and Windows NT based platforms.


Sample Code:

#!/usr/bin/env python3

import urllib.parse
import sys
import posixpath
import ntpath
import json

def path_parse( path_string, *, normalize = True, module = posixpath ):
    result = []
    if normalize:
        tmp = module.normpath( path_string )
    else:
        tmp = path_string
    while tmp != "/":
        ( tmp, item ) = module.split( tmp )
        result.insert( 0, item )
    return result

def dump_array( array ):
    string = "[ "
    for index, item in enumerate( array ):
        if index > 0:
            string += ", "
        string += "\"{}\"".format( item )
    string += " ]"
    return string

def test_url( url, *, normalize = True, module = posixpath ):
    url_parsed = urllib.parse.urlparse( url )
    path_parsed = path_parse( urllib.parse.unquote( url_parsed.path ),
        normalize=normalize, module=module )
    sys.stdout.write( "{}\n  --[n={},m={}]-->\n    {}\n".format( 
        url, normalize, module.__name__, dump_array( path_parsed ) ) )

test_url( "http://eg.com/hithere/something/else" )
test_url( "http://eg.com/hithere/something/else/" )
test_url( "http://eg.com/hithere/something/else/", normalize = False )
test_url( "http://eg.com/hithere/../else" )
test_url( "http://eg.com/hithere/../else", normalize = False )
test_url( "http://eg.com/hithere/../../else" )
test_url( "http://eg.com/hithere/../../else", normalize = False )
test_url( "http://eg.com/hithere/something/./else" )
test_url( "http://eg.com/hithere/something/./else", normalize = False )
test_url( "http://eg.com/hithere/something/./else/./" )
test_url( "http://eg.com/hithere/something/./else/./", normalize = False )

test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False )
test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False,
    module = ntpath )

Code output:

http://eg.com/hithere/something/else
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
  --[n=False,m=posixpath]-->
    [ "hithere", "something", "else", "" ]
http://eg.com/hithere/../else
  --[n=True,m=posixpath]-->
    [ "else" ]
http://eg.com/hithere/../else
  --[n=False,m=posixpath]-->
    [ "hithere", "..", "else" ]
http://eg.com/hithere/../../else
  --[n=True,m=posixpath]-->
    [ "else" ]
http://eg.com/hithere/../../else
  --[n=False,m=posixpath]-->
    [ "hithere", "..", "..", "else" ]
http://eg.com/hithere/something/./else
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else
  --[n=False,m=posixpath]-->
    [ "hithere", "something", ".", "else" ]
http://eg.com/hithere/something/./else/./
  --[n=True,m=posixpath]-->
    [ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else/./
  --[n=False,m=posixpath]-->
    [ "hithere", "something", ".", "else", ".", "" ]
http://eg.com/see%5C/if%5C/this%5C/works
  --[n=False,m=posixpath]-->
    [ "see\", "if\", "this\", "works" ]
http://eg.com/see%5C/if%5C/this%5C/works
  --[n=False,m=ntpath]-->
    [ "see", "if", "this", "works" ]

Notes:

  • On Windows NT based platforms os.path is ntpath
  • On Unix/Posix based platforms os.path is posixpath
  • ntpath will not handle backslashes (\) correctly (see last two cases in code/output) - which is why posixpath is recommended.
  • remember to use urllib.parse.unquote
  • consider using posixpath.normpath
  • The semantics of multiple path separators (/) is not defined by RFC 3986. However, posixpath collapses multiple adjacent path separators (i.e. it treats ///, // and / the same)
  • Even though POSIX and URL paths have similar syntax and semantics, they are not identical.

Normative References:

Gresham answered 26/8, 2014 at 0:33 Comment(3)
Python 3.4+ solution: url_path = PurePosixPath(urllib.parse.unquote(urllib.parse.urlparse(url‌​).path)).Adapter
@Adapter worthwhile to post this as an answerParette
Great answer. However this fails if there is an error in one of the scrapped url. For example: test_url( "http://eg.com/hithere//something/else" ) will lead to an infinite loop on while tmp != "/":Watterson
C
10

Note in Python3 import has changed to from urllib.parse import urlparse See documentation. Here is an example:

>>> from urllib.parse import urlparse
>>> url = 's3://bucket.test/my/file/directory'
>>> p = urlparse(url)
>>> p
ParseResult(scheme='s3', netloc='bucket.test', path='/my/file/directory', params='', query='', fragment='')
>>> p.scheme
's3'
>>> p.netloc
'bucket.test'
>>> p.path
'/my/file/directory'
Camelopardus answered 8/12, 2018 at 15:31 Comment(0)
C
3
import urlparse

output = urlparse.urlparse('http://www.example.com/temp/something/happen/index.html').path

output

'/temp/something/happen/index.html'

Split the path -- inbuilt rpartition func of string 

output.rpartition('/')[0]

'/temp/something/happen'
Chorion answered 24/8, 2016 at 10:0 Comment(0)
J
2

Here is an example using urlparse and rpartition.

# Python 2x:
from urlparse import urlparse
# Python 3x:
from urllib.parse import urlparse

def printPathTokens(full_url):
    print('printPathTokens() called: %s' % full_url)

    p_full = urlparse(full_url).path

    print(' . p_full url: %s' % p_full)

    # Split the path using rpartition method of string
    # rpartition "returns a tuple containing the part the before separator,
    # argument string and the part after the separator" 
    (rp_left, rp_match, rp_right) = p_full.rpartition('/')

    if rp_match == '': # returns the rpartition separator if found
        print(' . No slashes found in path')
    else:
        print(' . path to last resource: %s' % rp_left)
        if rp_right == '': # Ended with a slash
            print(' . last resource: (none)')
        else:
            print(' . last resource: %s' % (rp_right))


printPathTokens('http://www.example.com/temp/something/happen/index.html')
# Output:
# printPathTokens() called: http://www.example.com/temp/something/happen/index.html
# . p_full url: /temp/something/happen/index.html
# . path to last resource: /temp/something/happen
# . last resource: index.html

printPathTokens('http://www.example.com/temp/something/happen/')
# Output:
# printPathTokens() called: http://www.example.com/temp/something/happen/
# . p_full url: /temp/something/happen/
# . path to last resource: /temp/something/happen
# . last resource: (none)

printPathTokens('http://www.example.com/temp/something/happen')
# Output:
# printPathTokens() called: http://www.example.com/temp/something/happen
# . p_full url: /temp/something/happen
# . path to last resource: /temp/something
# . last resource: happen
Janitajanith answered 14/5, 2019 at 14:50 Comment(0)
S
0

A combination of urlparse and os.path.split will do the trick. The following script stores all sections of a url in a list, backwards.

import os.path, urlparse

def generate_sections_of_url(url):
    path = urlparse.urlparse(url).path
    sections = []; temp = "";
    while path != '/':
        temp = os.path.split(path)
        path = temp[0]
        sections.append(temp[1])
    return sections

This would return: ["else", "something", "hithere"]

Softwood answered 29/1, 2016 at 15:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.