How to retrieve a webpage in python, including any images

Asked 5/9, 2011 at 20:58 Answered 5/9, 2011 at 23:53

I'm trying to retrieve the source of a webpage, including any images. At the moment I have this:

import urllib

page = urllib.urlretrieve('http://127.0.0.1/myurl.php', 'urlgot.php')
print urlgot.php

which retrieves the source fine, but I also need to download any linked images.

I was thinking I could create a regular expression which searched for img src or similar in the downloaded source; however, I was wondering if there was urllib function that would retrieve the images as well? Similar to the wget command of:

wget -r --no-parent http://127.0.0.1/myurl.php

I don't want to use the os module and run the wget, as I want the script to run on all systems. For this reason I can't use any third party modules either.

Any help is much appreciated! Thanks

Hyssop answered 5/9, 2011 at 20:58 Comment(1)

Good luck. You should also ask how to package Python packages, and user your system's package manager. – Eph 6/9, 2011 at 0:10

Don't use regex when there is a perfectly good parser built in to Python:

from urllib.request import urlretrieve  # Py2: from urllib
from html.parser import HTMLParser      # Py2: from HTMLParser

base_url = 'http://127.0.0.1/'

class ImgParser(HTMLParser):
    def __init__(self, *args, **kwargs):
        self.downloads = []
        HTMLParser.__init__(self, *args, **kwargs)

    def handle_starttag(self, tag, attrs):
        if tag == 'img':
            for attr in attrs:
                if attr[0] == 'src':
                    self.downloads.append(attr[1])

parser = ImgParser()
with open('test.html') as f:
    # instead you could feed it the original url obj directly
    parser.feed(f.read())

for path in parser.downloads:
    url = base_url + path
    print(url)
    urlretrieve(url, path)

Porthole answered 5/9, 2011 at 23:53 Comment(1)

sorry for the late response, but this worked perfectly! thanks a lot =) – Hyssop 11/9, 2011 at 22:18

Use BeautifulSoup to parse the returned HTML and search for image links. You might also need to recursively fetch frames and iframes.

Thales answered 5/9, 2011 at 21:15 Comment(5)

forgive my ignorance but would that not mean it wouldn't be able to run on someone's computer who didn't have the beautiful soup module installed? – Hyssop 5/9, 2011 at 21:28

You need to distribute BeautifulSoap the library with your application. It should be not very difficult, unless you are dealing with native extensions which on Windows tend to have .exe installers. – Alsatian 5/9, 2011 at 21:49

thanks but that's not really what i'm looking for =( - i'll just use a regex to parse for img tags. cheers – Hyssop 5/9, 2011 at 22:37

@Jingo: That's fine, but be sure to deal properly with varying order of img attributes and multi-line img elements. You may also want to avoid img elements inside comments and strings. – Thales 5/9, 2011 at 22:42

@Jingo: be warned. HTML is not a regular language. – Discompose 6/9, 2011 at 0:2

Recommended topics

Hot tags