Here's a little example to get you started with using BeautifulSoup for this kind of exercise - you give this script a URL, and it will print out the URLs of images that are referenced from that page in the src
attribute of img
tags that end with jpg
or png
:
import sys, urllib, re, urlparse
from BeautifulSoup import BeautifulSoup
if not len(sys.argv) == 2:
print >> sys.stderr, "Usage: %s <URL>" % (sys.argv[0],)
sys.exit(1)
url = sys.argv[1]
f = urllib.urlopen(url)
soup = BeautifulSoup(f)
for i in soup.findAll('img', attrs={'src': re.compile('(?i)(jpg|png)$')}):
full_url = urlparse.urljoin(url, i['src'])
print "image URL: ", full_url
Then you can use urllib.urlretrieve
to download each of the images pointed to by full_url
, but at that stage you have to decide how to name them and what to do with the downloaded images, which isn't specified in your question.