How to test if a webpage is an image
Asked Answered
S

3

6

Sorry that the title wasn't very clear, basically I have a list with a whole series of url's, with the intention of downloading the ones that are pictures. Is there anyway to check if the webpage is an image, so that I can just skip over the ones that arent?

Thanks in advance

Subserve answered 14/3, 2015 at 9:18 Comment(1)
similar question: #14645380Thole
M
6

You can use requests module. Make a head request and check the content type. Head request will not download the response body.

import requests
response = requests.head(url)
print response.headers.get('content-type')
Moselle answered 14/3, 2015 at 9:45 Comment(1)
you can get Content-Type header using only stdlibAdamant
A
5

There is no reliable way. But you could find a solution that might be "good enough" in your case.

You could look at the file extension if it is present in the url e.g., .png, .jpg could indicate an image:

>>> import os
>>> name = url2filename('http://example.com/a.png?q=1')
>>> os.path.splitext(name)[1]
'.png'
>>> import mimetypes
>>> mimetypes.guess_type(name)[0]
'image/png'

where url2filename() function is defined here.

You could inspect Content-Type http header:

>>> import urllib.request
>>> r = urllib.request.urlopen(url) # make HTTP GET request, read headers
>>> r.headers.get_content_type()
'image/png'
>>> r.headers.get_content_maintype()
'image'
>>> r.headers.get_content_subtype()
'png'

You could check the very beginning of the http body for magic numbers indicating image files e.g., jpeg may start with b'\xff\xd8\xff\xe0' or:

>>> prefix = r.read(8)
>>> prefix # .png image
b'\x89PNG\r\n\x1a\n'

As @pafcu suggested in the answer to the related question, you could use imghdr.what() function:

>>> import imghdr
>>> imghdr.what(None, b'\x89PNG\r\n\x1a\n')
'png'
Adamant answered 14/3, 2015 at 9:41 Comment(0)
T
1

You can use mimetypes https://docs.python.org/3.0/library/mimetypes.html

import urllib
from mimetypes import guess_extension

url="http://example.com/image.png"
source = urllib.urlopen(url)
extension = guess_extension(source.info()['Content-Type'])
print extension

this will return "png"

Thole answered 14/3, 2015 at 9:34 Comment(2)
it won't work on Python 3 (the question has python-3.x tag)Adamant
You can make it work if you fix the imports. Also, it is not clear why do you want to guess the file extension here. Content-Type is clear by itself: it even may have the word 'image' in it (you could extract it as shown in my answer)Adamant

© 2022 - 2024 — McMap. All rights reserved.