Fetching Image from URL using BeautifulSoup

Asked 23/6, 2014 at 1:19 Answered 23/6, 2014 at 1:40

Solved python url web-scraping beautifulsoup urllib

I am trying to fetch important images and not thumbnail or other gifs from the Wikipedia page and using following code. However the "img" is coming as length of "0". any suggestion on how to rectify it.

Code :

import urllib
import urllib2
from bs4 import BeautifulSoup
import os

html = urllib2.urlopen("http://en.wikipedia.org/wiki/Main_Page")

soup = BeautifulSoup(html)

imgs = soup.findAll("div",{"class":"image"})

Also if someone can explain in detail that how to use the findAll by looking at "source element" in webpage. That will be awesome.

Barahona answered 23/6, 2014 at 1:19 Comment(0)

The a tags on the page have an image class, not div:

>>> img_links = soup.findAll("a", {"class":"image"})
>>> for img_link in img_links:
...     print img_link.img['src']
... 
//upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Stora_Kronan.jpeg/100px-Stora_Kronan.jpeg
//upload.wikimedia.org/wikipedia/commons/thumb/4/4b/Christuss%C3%A4ule_8.jpg/77px-Christuss%C3%A4ule_8.jpg
...

Or, even better, use a.image > img CSS selector:

>>> for img in soup.select('a.image > img'):
...      print img['src']
//upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Stora_Kronan.jpeg/100px-Stora_Kronan.jpeg
//upload.wikimedia.org/wikipedia/commons/thumb/4/4b/Christuss%C3%A4ule_8.jpg/77px-Christuss%C3%A4ule_8.jpg 
...

UPD (downloading images using urllib.urlretrieve):

from urllib import urlretrieve
import urlparse
from bs4 import BeautifulSoup
import urllib2

url = "http://en.wikipedia.org/wiki/Main_Page"
soup = BeautifulSoup(urllib2.urlopen(url))
for img in soup.select('a.image > img'):
    img_url = urlparse.urljoin(url, img['src'])
    file_name = img['src'].split('/')[-1]
    urlretrieve(img_url, file_name)

Militant answered 23/6, 2014 at 1:40 Comment(1)

@Barahona sure, check the UPD section. – Militant 23/6, 2014 at 12:41

I don't see any div tags with a class called 'image' on that page.

You could get all the image tags and throw away ones that are small.

imgs = soup.select('img')

Waistline answered 23/6, 2014 at 1:40 Comment(0)

Recommended topics

Hot tags