how to follow meta refreshes in Python
Asked Answered
P

6

15

Python's urllib2 follows 3xx redirects to get the final content. Is there a way to make urllib2 (or some other library such as httplib2) also follow meta refreshes? Or do I need to parse the HTML manually for the refresh meta tags?

Pat answered 23/2, 2010 at 13:31 Comment(0)
P
1

OK, seems no library supports it so I have been using this code:

import urllib2
import urlparse
import re

def get_hops(url):
    redirect_re = re.compile('<meta[^>]*?url=(.*?)["\']', re.IGNORECASE)
    hops = []
    while url:
        if url in hops:
            url = None
        else:
            hops.insert(0, url)
            response = urllib2.urlopen(url)
            if response.geturl() != url:
                hops.insert(0, response.geturl())
            # check for redirect meta tag
            match = redirect_re.search(response.read())
            if match:
                url = urlparse.urljoin(url, match.groups()[0].strip())
            else:
                url = None
    return hops
Pat answered 27/2, 2010 at 1:27 Comment(0)
K
12

Here is a solution using BeautifulSoup and httplib2 (and certificate based authentication):

import BeautifulSoup
import httplib2

def meta_redirect(content):
    soup  = BeautifulSoup.BeautifulSoup(content)

    result=soup.find("meta",attrs={"http-equiv":"Refresh"})
    if result:
        wait,text=result["content"].split(";")
        if text.strip().lower().startswith("url="):
            url=text.strip()[4:]
            return url
    return None

def get_content(url, key, cert):
    
    h=httplib2.Http(".cache")
    h.add_certificate(key,cert,"")
    
    resp, content = h.request(url,"GET")
    
    # follow the chain of redirects
    while meta_redirect(content):
        resp, content = h.request(meta_redirect(content),"GET") 
            
    return content  
Kerr answered 8/9, 2010 at 14:30 Comment(2)
'url=text[4:]' should be 'url=text.strip()[4:]' to remove leading spaces Note als that I sometimes see REFRESH iso RefreshGermane
I agree. I fixed the code as you suggested.Kerr
T
6

A similar solution using the requests and lxml libraries. Also does a simple check that the thing being tested is actually HTML (a requirement in my implementation). Also is able to capture and use cookies by using the request library's sessions (sometimes necessary if redirection + cookies are being used as an anti-scraping mechanism).

import magic
import mimetypes
import requests
from lxml import html 
from urlparse import urljoin

def test_for_meta_redirections(r):
    mime = magic.from_buffer(r.content, mime=True)
    extension = mimetypes.guess_extension(mime)
    if extension == '.html':
        html_tree = html.fromstring(r.text)
        attr = html_tree.xpath("//meta[translate(@http-equiv, 'REFSH', 'refsh') = 'refresh']/@content")[0]
        wait, text = attr.split(";")
        if text.lower().startswith("url="):
            url = text[4:]
            if not url.startswith('http'):
                # Relative URL, adapt
                url = urljoin(r.url, url)
            return True, url
    return False, None


def follow_redirections(r, s):
    """
    Recursive function that follows meta refresh redirections if they exist.
    """
    redirected, url = test_for_meta_redirections(r)
    if redirected:
        r = follow_redirections(s.get(url), s)
    return r

Usage:

s = requests.session()
r = s.get(url)
# test for and follow meta redirects
r = follow_redirections(r, s)
Turnedon answered 4/6, 2013 at 19:52 Comment(2)
Sometimes meta refresh redirects point to relative URLs. For instance, Facebook does <noscript><meta http-equiv="refresh" content="0; URL=/?_fb_noscript=1" /></noscript>. It would be good to detect relative URLs and prepend the scheme and host.Ewald
@JosephMornin: Adapted. I realized it still doesn't support circular redirects though...always something.Turnedon
M
2

If you dont want to use bs4 ,you can use lxml like this:

from lxml.html import soupparser

def meta_redirect(content):
    root = soupparser.fromstring(content)
    result_url = root.xpath('//meta[@http-equiv="refresh"]/@content')
    if result_url:
        result_url = str(result_url[0])
        urls = result_url.split('URL=') if len(result_url.split('url=')) < 2    else result_url.split('url=')
        url = urls[1] if len(urls) >= 2 else None
    else:
        return None
    return url
Micron answered 13/2, 2017 at 7:42 Comment(0)
P
1

OK, seems no library supports it so I have been using this code:

import urllib2
import urlparse
import re

def get_hops(url):
    redirect_re = re.compile('<meta[^>]*?url=(.*?)["\']', re.IGNORECASE)
    hops = []
    while url:
        if url in hops:
            url = None
        else:
            hops.insert(0, url)
            response = urllib2.urlopen(url)
            if response.geturl() != url:
                hops.insert(0, response.geturl())
            # check for redirect meta tag
            match = redirect_re.search(response.read())
            if match:
                url = urlparse.urljoin(url, match.groups()[0].strip())
            else:
                url = None
    return hops
Pat answered 27/2, 2010 at 1:27 Comment(0)
G
0

I wanted to provide an updated version of the code(s) written here. I have made my own adaptations for Python 3.10+, considering that urlparse is not for Python 3 and has been replaced by urllib.parse, but have also removed httplib2/urllib2 and gone in favor of requests. Please understand, I merged code from posts already here and modified to create an updated answer, this is just a collaborated update and not my own work.

import requests as re
from bs4 import BeautifulSoup
from urllib.parse import urljoin

def meta_redirect(content):
    soup  = BeautifulSoup(content.text.lower(), features='html.parser')

    result=soup.find("meta", attrs={"http-equiv":"refresh"})
    
    if result:
        wait, text=result["content"].split(";")
        
        if text.strip().lower().startswith("url="):            
            
            url=text.strip()[4:].replace("'","")
    
            if not url.startswith('http'):
                url = urljoin(content.url, url)
            
            return url
    return None

def get_content(url):
    
    content = re.get(url, verify=False)
    
    # follow the chain of redirects
    while meta_redirect(content):
        content = re.get(meta_redirect(content), verify=False) 
            
    return content

def main():
    url = 'put your url here that has meta redirects'

    source = get_content(url)

    # This will print the source of the website
    print(source.text)
Gosser answered 5/2 at 20:34 Comment(0)
A
-1

Use BeautifulSoup or lxml to parse the HTML.

Agonize answered 23/2, 2010 at 13:39 Comment(2)
using a HTML parser to just extract the meta refresh tag is overkill, atleast for my purposes. Was hoping there was a Python HTTP library that did this automatically.Pat
Well meta it is a html tag, so it is unlikely that you will find this functionality in an http library.Privileged

© 2022 - 2024 — McMap. All rights reserved.