Which should I be using: urlparse or urlsplit?
Asked Answered
K

3

45

Which URL parsing function pair should I be using and why?

Kitsch answered 29/3, 2011 at 12:2 Comment(0)
C
21

Directly from the docs you linked yourself:

urllib.parse.urlsplit(urlstring, scheme='', allow_fragments=True)
This is similar to urlparse(), but does not split the params from the URL. This should generally be used instead of urlparse() if the more recent URL syntax allowing parameters to be applied to each segment of the path portion of the URL (see RFC 2396) is wanted.

Coleridge answered 29/3, 2011 at 12:8 Comment(3)
Since those URLs (with parameters attached any path element) are rarely used in practice, perhaps it would be worth adding an example, showing the differences in the parsed results? e.g. like here: doughellmann.com/PyMOTW/urlparse/#parsingHallah
Updated Python 3 link for those interestedCowie
Could you provide example URLs illustrating the difference? I've read the Python docs and briefly looked at RFC 2396, but it is unclear which type of URL parameters they are referring to other than the fact that they use a semicolon.Breadthways
P
14

Given the documentation you linked didn't include an example with an nonempty params I was also confused until I found this.

>>> urllib.parse.urlparse("http://example.com/pa/th;param1=foo;param2=bar?name=val#frag")
ParseResult(scheme='http', netloc='example.com', path='/pa/th', params='param1=foo;param2=bar', query='name=val', fragment='frag')

(Some history because I got nerd-sniped.)

I'd never heard of the URL "parameters" other than url component params i.e. /user/213/settings or query params /user?id=213 and I think it's essentially obsolete.

In the beginning, RFC 1738 defined the HTTP URL to never allow ; in the path:

http://<host>:<port>/<path>?<searchpart>

Within the <path> and <searchpart> components, "/", ";", "?" are reserved.

; was reserved with special meaning in other schemes, like the ftp:// url-path:

<cwd1>/<cwd2>/.../<cwdN>/<name>;type=<typecode>

Apparently in 1995, RFC 1808 defined URL params as a top-level component between path and query:

<scheme>://<net_loc>/<path>;<params>?<query>#<fragment>

Then in 1998, RFC 2396 defined URIs as having adjacent top-level components path and query:

<scheme>://<authority><path>?<query>

where the path is defined as multiple path_segments that each could include param:

path          = [ abs_path | opaque_part ]
abs_path      = "/"  path_segments
path_segments = segment *( "/" segment )
segment       = *pchar *( ";" param )

Finally in 2005, RFC 3986 obsoleted RFC 1808 and 2396, defining URI similarly to RFC 2396:

URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ] 

hier-part   = "//" authority path-abempty
            / path-absolute
            / path-rootless
            / path-empty

And the special syntax of ;params is considered an opaque part of the URI syntax that may be specific to the HTTP(S) scheme or just some specific implementation:

Aside from dot-segments in hierarchical paths, a path segment is considered opaque by the generic syntax. URI producing applications often use the reserved characters allowed in a segment to delimit scheme-specific or dereference-handler-specific subcomponents. For example, the semicolon (";") and equals ("=") reserved characters are often used to delimit parameters and parameter values applicable to that segment. The comma (",") reserved character is often used for similar purposes. For example, one URI producer might use a segment such as "name;v=1.1" to indicate a reference to version 1.1 of "name", whereas another might use a segment such as "name,1.1" to indicate the same. Parameter types may be defined by scheme-specific semantics, but in most cases the syntax of a parameter is specific to the implementation of the URI's dereferencing algorithm.

Pecuniary answered 6/8, 2020 at 18:36 Comment(0)
B
8

As the document says
urlparse.urlparse returns 6-tuple(with additional parameter tuple)
urlparse.urlsplit returns 5-tuple

Attribute   |Index | Value                                             | Value if not present
params    |     3   | Parameters for last path element | empty string


FYI: According to [RFC2396](https://www.rfc-editor.org/rfc/rfc2396.html#appendix-C), _parameter_ in URL specification > Extensive testing of current client applications demonstrated that the majority of deployed systems do not use the ";" character to indicate trailing parameter information, and that the presence of a semicolon in a path segment does not affect the relative parsing of that segment. Therefore, parameters have been removed as a separate component and may now appear in any path segment. Their influence has been removed from the algorithm for resolving a relative URI reference.
Burkes answered 1/5, 2015 at 2:57 Comment(2)
From your answer it is not clear which method you do advise to use.Accumulate
It depends, if you needs parameter in URL then use urlsplit.Burkes

© 2022 - 2024 — McMap. All rights reserved.