URL parsing in Python - normalizing double-slash in paths
Asked Answered
W

9

9

I am working on an app which needs to parse URLs (mostly HTTP URLs) in HTML pages - I have no control over the input and some of it is, as expected, a bit messy.

One problem I'm encountering frequently is that urlparse is very strict (and possibly even buggy?) when it comes to parsing and joining URLs that have double-slashes in the path part, for example:

testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl, 
                 urlparse.urlparse(testUrl).path)

Instead of the expected result http://www.example.com//path (or even better, with a normalized single slash), I end up with http://path.

BTW the reason I'm running such code is because it's the only way I found so far to strip the query / fragment part off of URLs. Maybe there is a better way to do it, but I couldn't find one.

Can anyone recommend a way to avoid this, or should I just normalize the path myself using a (relatively simple, I know) regex?

Waler answered 19/1, 2012 at 12:21 Comment(3)
What do you mean by "it's the only way to strip the query / fragment part"? What does the slash have to do with the query?Jeraldjeraldine
It has nothing to do with the query - the reason I'm parsing a URL and then joining it's own path back into it is because I want to strip out the query and fragment. If there was a better way to do it, I wouldn't need to solve this problemWaler
I think urlparse is just implementing the RFC of URLs correctly - that specifies that after the <hostname>:<port> part seems to be only one slash (tools.ietf.org/html/rfc1738) - so in your case I would try to strip the extra slash before passing it to urlparse.Socialize
F
5

If you only want to get the url without the query part, I would skip the urlparse module and just do:

testUrl.rsplit('?')

The url will be at index 0 of the list returned and the query at index 1.

It is not possible to have two '?' in an url so it should work for all urls.

Fruition answered 19/1, 2012 at 12:40 Comment(1)
This does not answer any urlparse issues, but it definitely solves my use case in a very simple way. Thanks!Waler
A
5

The path (//path) alone is not valid, which confuses the function and gets interpreted as a hostname

https://www.rfc-editor.org/rfc/rfc3986.html#section-3.3

If a URI does not contain an authority component, then the path cannot begin with two slash characters ("//").

I don't particularly like either of these solutions, but they work:

import re
import urlparse

testurl = 'http://www.example.com//path?foo=bar'

parsed = list(urlparse.urlparse(testurl))
parsed[2] = re.sub("/{2,}", "/", parsed[2]) # replace two or more / with one
cleaned = urlparse.urlunparse(parsed)

print cleaned
# http://www.example.com/path?foo=bar

print urlparse.urljoin(
    testurl, 
    urlparse.urlparse(cleaned).path)

# http://www.example.com//path

Depending on what you are doing, you could do the joining manually:

import re
import urlparse

testurl = 'http://www.example.com//path?foo=bar'
parsed = list(urlparse.urlparse(testurl))

newurl = ["" for i in range(6)] # could urlparse another address instead

# Copy first 3 values from
# ['http', 'www.example.com', '//path', '', 'foo=bar', '']
for i in range(3):
    newurl[i] = parsed[i]
    
# Rest are blank
for i in range(4, 6):
    newurl[i] = ''

print urlparse.urlunparse(newurl)
# http://www.example.com//path
Anagnos answered 19/1, 2012 at 12:59 Comment(1)
The URL is in fact valid, because it does contain an authority section - so the URL may begin with '//'. In any case even if it is not being able to parse invalid but "real world" URLs could be helpful.Waler
I
2

It is mentioned in official urlparse docs that:

If url is an absolute URL (that is, starting with // or scheme://), the url‘s host name and/or scheme will be present in the result. For example

urljoin('http://www.cwi.nl/%7Eguido/Python.html',
...         '//www.python.org/%7Eguido')
'http://www.python.org/%7Eguido'

If you do not want that behavior, preprocess the url with urlsplit() and urlunsplit(), removing possible scheme and netloc parts.

So you can do :

urlparse.urljoin(testUrl,
             urlparse.urlparse(testUrl).path.replace('//','/'))

Output = 'http://www.example.com/path'

Inquiline answered 19/1, 2012 at 12:37 Comment(0)
G
2

Try This:

def http_normalize_slashes(url):
    url = str(url)
    segments = url.split('/')
    correct_segments = []
    for segment in segments:
        if segment != '':
            correct_segments.append(segment)
    first_segment = str(correct_segments[0])
    if first_segment.find('http') == -1:
        correct_segments = ['http:'] + correct_segments
    correct_segments[0] = correct_segments[0] + '/'
    normalized_url = '/'.join(correct_segments)
    return normalized_url

Example URLs:

print(http_normalize_slashes('http://www.example.com//path?foo=bar'))
print(http_normalize_slashes('http:/www.example.com//path?foo=bar'))
print(http_normalize_slashes('www.example.com//x///c//v///path?foo=bar'))
print(http_normalize_slashes('http://////www.example.com//x///c//v///path?foo=bar'))

Will return:

http://www.example.com/path?foo=bar
http://www.example.com/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar

Hope it helps.. :)

Gink answered 24/10, 2015 at 18:57 Comment(0)
V
1

Using furl, try:

import furl

f = furl.furl('http://www.example.com//path?foo=bar')
f.set(path=f.path.normalize())
Vanhomrigh answered 2/7, 2022 at 8:57 Comment(0)
H
0

Can not that be a solution?

urlparse.urlparse(testUrl).path.replace('//', '/')
Hypoplasia answered 19/1, 2012 at 12:54 Comment(0)
B
0

This answer seemed to give the best results in the cases I tried for correcting double slashes in paths, without touching the initial double slash in the http:// bit.

here's the code:

from urlparse import urljoin
from functools import reduce


def slash_join(*args):
    return reduce(urljoin, args).rstrip("/")
Burnedout answered 29/6, 2018 at 13:23 Comment(0)
O
0

i've adopted to my needs @yunhasnawa's answer. here is part:

import urllib2
from urlparse import urlparse, urlunparse

def sanitize_url(url):
    url_parsed = urlparse(url)  
    return urlunparse((url_parsed.scheme, url_parsed.netloc, avoid_double_slash(url_parsed.path), '', '', ''))

def avoid_double_slash(path):
  parts = path.split('/')
  not_empties = [part for part in parts if part]
  return '/'.join(not_empties)


>>> sanitize_url('https://hostname.doma.in:8443/complex-path////next//')
'https://hostname.doma.in:8443/complex-path/next'
Outdare answered 5/12, 2018 at 16:53 Comment(0)
Q
0

This might not be totally safe, but you could use this regex:

import re


def sanitize_url(url: str) -> str:
    return re.sub(r"([^:]/)(/)+", r"\1", url)

It will replace "[not colon] followed by 2 slashes" by the "[not colon] followed by a single slash". The [not colon] is used to preserve the http:// or https://.

Quaquaversal answered 18/12, 2019 at 19:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.