Getting the final redirected URL
Asked Answered
W

2

7

My code is as follows:

url_orig ='http://www.has-sante.fr/portail/jcms/c_676945/fr/prialt-ct-5245'
u = urllib.request.urlopen(url_orig)
print (u.geturl())

Basically when the URL gets redirected twice. The output should be:

http://www.has-sante.fr/portail/upload/docs/application/pdf/2008-07/ct-5245_prialt_.pdf

But the output that I'm getting is the first redirect:

http://www.has-sante.fr/portail/plugins/ModuleXitiKLEE/types/FileDocument/doXiti.jsp?id=c_676945

How do I get the required final URL? Any help would be appreciated!

Whitewash answered 21/6, 2014 at 7:4 Comment(2)
loop until the response isn't a redirect?Glower
How do I do that? Sorry I'm new to this.Whitewash
S
20

This might be a bit overkill for what you want, but it is an alternative to using regular expressions. This answer uses the Selenium web automator Python APIs to follow the redirects. It will also open up the pdf file in a browser window. The code below requires that you are using Firefox, but you can also use other browsers by replacing the name with the one you want to use i.e. webdriver.Chrome(), webdriver.Ie().

To install selenium: pip install selenium

The code:

from selenium import webdriver

driver = webdriver.Firefox()
link = 'http://www.has-sante.fr/portail/jcms/c_676945/fr/prialt-ct-5245'

driver.get(link)
print(driver.current_url)

It is also possible to run the browser in the background so no window pops up. The added benefit to this solution is that if they change the way the re-direction works you will not need to update the regular expressions in your code.

Snob answered 21/6, 2014 at 9:50 Comment(1)
Yes I had to use this method instead of urllib. Thanks a lot!Whitewash
E
1

This will work, they are redirecting using javascript or the html tag, and as such looking for the "Location" header wont work. This is not an elegant solution, but it works.

import urllib.request
url ='http://www.has-sante.fr/portail/jcms/c_676945/fr/prialt-ct-5245'

req = str(urllib.request.urlopen(url).read())
url = req.split("URL=\\'")[1].split("\\'\">'")[0].strip("../")

print("http://www.has-sante.fr/portail/" + url)
Eldaelden answered 21/6, 2014 at 9:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.