How can I prepend the 'http://' protocol to a url when necessary? [duplicate]
Asked Answered
A

6

23

I need to parse an URL. I'm currently using urlparse.urlparse() and urlparse.urlsplit().

The problem is that i can't get the "netloc" (host) from the URL when it's not present the scheme. I mean, if i have the following URL:

www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1

I can't get the netloc: www.amazon.com

According to python docs:

Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component.

So, it's this way on purpose. But, i still don't know how to get the netloc from that URL.

I think i could check if the scheme is present, and if it's not, then add it, and then parse it. But this solution doesn't seems really good.

Do you have a better idea?

EDIT: Thanks for all the answers. But, i cannot do the "startswith" thing that's proposed by Corey and others. Becouse, if i get an URL with other protocol/scheme i would mess it up. See:

If i get this URL:

ftp://something.com

With the code proposed i would add "http://" to the start and would mess it up.

The solution i found

if not urlparse.urlparse(url).scheme:
   url = "http://"+url
return urlparse.urlparse(url)

Something to note:

I do some validation first, and if no scheme is given i consider it to be http://

Adlai answered 14/6, 2011 at 14:18 Comment(5)
Is this because the protocol portion of the URL - the http:// - is missing?Backlash
Yes, that's the reason. But, how can i get it if the scheme is still missing?Adlai
In your solution, i'd still check for the leading // (and possibly just /), since a proper url would have that (even when the scheme is missing)Nonstriated
@TokenMacGuy I do that. It's in the "Validation" part. Good to mention. Check Steve answer.Adlai
Now, if you had provided a self-answer with your solution, you might get some upvotes for that, too. (or do you want someone else to post your answer, or something else entirely?)Inside
P
6

The documentation has this exact example, just below the text you pasted. Adding '//' if it's not there will get what you want. If you don't know whether it'll have the protocol and '//' you can use a regex (or even just see if it already contains '//') to determine whether or not you need to add it.

Your other option would be to use split('/') and take the first element of the list it returns, which will ONLY work when the url has no protocol or '//'.

EDIT (adding for future readers): a regex for detecting the protocol would be something like re.match('(?:http|ftp|https)://', url)

Primeval answered 14/6, 2011 at 14:27 Comment(5)
I still have the different protocols problem (see comment on Bryan answer). ThanksAdlai
Then you can use a regex - check for (?:http|ftp|etc):// - or just check for the existence of '://' in the string. It depends how robust you want it to be; full URL parsing is complex.Primeval
+1 You're right SteveMc. What would be faster? Parse it with the protocol list that you posted or made the urlparse that i proposed?Adlai
urlparse likely (though I haven't looked) uses a regex to do the parsing (because as I said, it's complicated) but the way you've done it seems very reasonable so I would leave it as you've done it. You can profile it if you're curious.Primeval
Thanks for your answer Steve. I did something similar to this. The regex at comment is very good. You should add it to the asnwer for future readers.Adlai
A
14

looks like you need to specify the protocol to get netloc.

adding it if it's not present might look like this:

import urlparse

url = 'www.amazon.com/Programming-Python-Mark-Lutz'
if '//' not in url:
    url = '%s%s' % ('http://', url)
p = urlparse.urlparse(url)
print p.netloc

More about the issue: https://bugs.python.org/issue754016

Allusive answered 14/6, 2011 at 15:8 Comment(0)
P
6

The documentation has this exact example, just below the text you pasted. Adding '//' if it's not there will get what you want. If you don't know whether it'll have the protocol and '//' you can use a regex (or even just see if it already contains '//') to determine whether or not you need to add it.

Your other option would be to use split('/') and take the first element of the list it returns, which will ONLY work when the url has no protocol or '//'.

EDIT (adding for future readers): a regex for detecting the protocol would be something like re.match('(?:http|ftp|https)://', url)

Primeval answered 14/6, 2011 at 14:27 Comment(5)
I still have the different protocols problem (see comment on Bryan answer). ThanksAdlai
Then you can use a regex - check for (?:http|ftp|etc):// - or just check for the existence of '://' in the string. It depends how robust you want it to be; full URL parsing is complex.Primeval
+1 You're right SteveMc. What would be faster? Parse it with the protocol list that you posted or made the urlparse that i proposed?Adlai
urlparse likely (though I haven't looked) uses a regex to do the parsing (because as I said, it's complicated) but the way you've done it seems very reasonable so I would leave it as you've done it. You can profile it if you're curious.Primeval
Thanks for your answer Steve. I did something similar to this. The regex at comment is very good. You should add it to the asnwer for future readers.Adlai
F
6

If the protocol is always http you can use only one line:

return "http://" + url.split("://")[-1]

A better option is to use the protocol if it passed:

return url if "://" in url else "http://" + url
Fencesitter answered 20/3, 2014 at 11:19 Comment(2)
Do you mean return url if "://" in url else "http://" + url?Piping
Thanks Robert Dodd for the bug report.Thicket
K
5

From the docs:

Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component.

So you can just do:

In [1]: from urlparse import urlparse

In [2]: def get_netloc(u):
   ...:     if not u.startswith('http'):
   ...:         u = '//' + u
   ...:     return urlparse(u).netloc
   ...: 

In [3]: get_netloc('www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1')
Out[3]: 'www.amazon.com'

In [4]: get_netloc('http://www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1')
Out[4]: 'www.amazon.com'

In [5]: get_netloc('https://www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1')
Out[5]: 'www.amazon.com'
Kt answered 14/6, 2011 at 15:13 Comment(0)
T
2

Have you considered just checking for the presence of "http://" at the start of the url, and add it if it's not there? Another solution, assuming that first part really is the netloc and not part of a relative url, is to just grab everything up to the first "/" and use that as the netloc.

Tini answered 14/6, 2011 at 14:27 Comment(2)
Yes, that's what i'm doing right now. But doesn't like much. I'll keep with that if nothing better arise. Thanks!Adlai
I still got one more problem. What if other protocol/scheme is used? If i check for http:// in this URL: "ftp:// my.home.com" then i would think that it's not present. If I add it, i would mess it upAdlai
C
0

This one liner would do it.

netloc = urlparse('//' + ''.join(urlparse(url)[1:])).netloc
Clementineclementis answered 5/4, 2013 at 23:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.