No schema supplied and other errors with using requests.get()
Asked Answered
C

7

33

I'm learning Python by following Automate the Boring Stuff. This program is supposed to go to http://xkcd.com/ and download all the images for offline viewing.

I'm on version 2.7 and Mac.

For some reason, I'm getting errors like "No schema supplied" and errors with using request.get() itself.

Here is my code:

# Saves the XKCD comic page for offline read

import requests, os, bs4, shutil

url = 'http://xkcd.com/'

if os.path.isdir('xkcd') == True: # If xkcd folder already exists
    shutil.rmtree('xkcd') # delete it
else: # otherwise
    os.makedirs('xkcd') # Creates xkcd foulder.


while not url.endswith('#'): # If there are no more posts, it url will endswith #, exist while loop
    # Download the page
    print 'Downloading %s page...' % url
    res = requests.get(url) # Get the page
    res.raise_for_status() # Check for errors

    soup = bs4.BeautifulSoup(res.text) # Dowload the page
    # Find the URL of the comic image
    comicElem = soup.select('#comic img') # Any #comic img it finds will be saved as a list in comicElem
    if comicElem == []: # if the list is empty
        print 'Couldn\'t find the image!'
    else:
        comicUrl = comicElem[0].get('src') # Get the first index in comicElem (the image) and save to
        # comicUrl

        # Download the image
        print 'Downloading the %s image...' % (comicUrl)
        res = requests.get(comicUrl) # Get the image. Getting something will always use requests.get()
        res.raise_for_status() # Check for errors

        # Save image to ./xkcd
        imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
        for chunk in res.iter_content(10000):
            imageFile.write(chunk)
        imageFile.close()
    # Get the Prev btn's URL
    prevLink = soup.select('a[rel="prev"]')[0]
    # The Previous button is first <a rel="prev" href="/1535/" accesskey="p">&lt; Prev</a>
    url = 'http://xkcd.com/' + prevLink.get('href')
    # adds /1535/ to http://xkcd.com/

print 'Done!'

Here are the errors:

Traceback (most recent call last):
  File "/Users/XKCD.py", line 30, in <module>
    res = requests.get(comicUrl) # Get the image. Getting something will always use requests.get()
  File "/Library/Python/2.7/site-packages/requests/api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "/Library/Python/2.7/site-packages/requests/api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "/Library/Python/2.7/site-packages/requests/sessions.py", line 451, in request
    prep = self.prepare_request(req)
  File "/Library/Python/2.7/site-packages/requests/sessions.py", line 382, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "/Library/Python/2.7/site-packages/requests/models.py", line 304, in prepare
    self.prepare_url(url, params)
  File "/Library/Python/2.7/site-packages/requests/models.py", line 362, in prepare_url
    to_native_string(url, 'utf8')))
requests.exceptions.MissingSchema: Invalid URL '//imgs.xkcd.com/comics/the_martian.png': No schema supplied. Perhaps you meant http:////imgs.xkcd.com/comics/the_martian.png?

The thing is I've been reading the section in the book about the program multiple times, reading the Requests doc, as well as looking at other questions on here. My syntax looks right.

Thanks for your help!

Edit:

This didn't work:

comicUrl = ("http:"+comicElem[0].get('src')) 

I thought adding the http: before would get rid of the no schema supplied error.

Corr answered 11/6, 2015 at 1:44 Comment(5)
gist.github.com/auscompgeek/5218149Earvin
It's using urllib2, looking long and complicated as ever :DCorr
paste.ofcode.org/ZdXRAmTv3t9q9gYtv9eVDNEarvin
This works! But now I gotta go study the code to find out why... I'll just compare the codes. Thank you!Corr
The thing is, I just reran the old code that didn't work - and now it works just fine... Now I'm REALLY confused.Corr
E
17

change your comicUrl to this

comicUrl = comicElem[0].get('src').strip("http://")
comicUrl="http://"+comicUrl
if 'xkcd' not in comicUrl:
    comicUrl=comicUrl[:7]+'xkcd.com/'+comicUrl[7:]

print "comic url",comicUrl
Earvin answered 11/6, 2015 at 2:9 Comment(6)
The error is the no schema supplied error. Specifically: requests.exceptions.MissingSchema: Invalid URL 'http:/1525/bg.png': No schema supplied. Perhaps you meant http:/1525/bg.png?Corr
Hey, your code got my program to run further than before, but it stopped at xkcd.com with the same res = requests.get(comicUrl) error. This would be around (xkcd.com/1514, it begin at 1536). Do you have any other suggestions? I pasted your code before the print 'Downloading the image'Corr
The odd thing is, it runs differently everytime I run the program. One run would get farther than the one before and so on. The best I've gotten was 44 downloads in the folder. Wtf is going on?Corr
It worked fine for me....If it's not working check the url,add as many print statements and if you don't care about one or more images not downloading then add try and exceptEarvin
Did you try running it from your computer? How many files did it downloaded to your folder? Mine went as far as 44 downloads.Corr
I know this is an old post. The above fix worked for me also. I was experimenting with comicUrl = comicElem[0].get('src').strip('http://') that you have. Why does comicUrl = comicElem[0].get('src').strip('http:/') still also remove both // from comicURL?Hashimoto
C
30

No schema means you haven't supplied the http:// or https:// supply these and it will do the trick.

Edit: Look at this URL string!:

URL '//imgs.xkcd.com/comics/the_martian.png':

Coonskin answered 11/6, 2015 at 1:56 Comment(5)
But I'm not passing a URL to it, I'm looking through the HTML document and finding comicElem = soup.select('#comic img').Corr
Yes, but in the html it will be using a relative URL- requests needs a absolute one-try this: comicUrl = "http://imgs.xkcd.com/comics/"+comicElem[0].get('src') or some varian on it.Coonskin
I tried Ajay's suggestion, which is similar to yours, and I got the shema errorCorr
It is in there though, it's this: comicUrl = "http:"+comicElem[0].get('src')Corr
This is what fixed the same problem for me, sometime we forget URL starts from http or httpsSnatchy
E
17

change your comicUrl to this

comicUrl = comicElem[0].get('src').strip("http://")
comicUrl="http://"+comicUrl
if 'xkcd' not in comicUrl:
    comicUrl=comicUrl[:7]+'xkcd.com/'+comicUrl[7:]

print "comic url",comicUrl
Earvin answered 11/6, 2015 at 2:9 Comment(6)
The error is the no schema supplied error. Specifically: requests.exceptions.MissingSchema: Invalid URL 'http:/1525/bg.png': No schema supplied. Perhaps you meant http:/1525/bg.png?Corr
Hey, your code got my program to run further than before, but it stopped at xkcd.com with the same res = requests.get(comicUrl) error. This would be around (xkcd.com/1514, it begin at 1536). Do you have any other suggestions? I pasted your code before the print 'Downloading the image'Corr
The odd thing is, it runs differently everytime I run the program. One run would get farther than the one before and so on. The best I've gotten was 44 downloads in the folder. Wtf is going on?Corr
It worked fine for me....If it's not working check the url,add as many print statements and if you don't care about one or more images not downloading then add try and exceptEarvin
Did you try running it from your computer? How many files did it downloaded to your folder? Mine went as far as 44 downloads.Corr
I know this is an old post. The above fix worked for me also. I was experimenting with comicUrl = comicElem[0].get('src').strip('http://') that you have. Why does comicUrl = comicElem[0].get('src').strip('http:/') still also remove both // from comicURL?Hashimoto
A
3

Explanation:

A few XKCD pages have special content that isn’t a simple image file. That’s fine; you can just skip those. If your selector doesn’t find any elements, then soup.select('#comic img') will return a blank list.

Working Code:

import requests,os,bs4,shutil

url='http://xkcd.com'

#making new folder
if os.path.isdir('xkcd') == True:
    shutil.rmtree('xkcd')
else:
    os.makedirs('xkcd')


#scrapiing information
while not url.endswith('#'):
    print('Downloading Page %s.....' %(url))
    res = requests.get(url)          #getting page
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text)

    comicElem = soup.select('#comic img')     #getting img tag under  comic divison
    if comicElem == []:                        #if not found print error
        print('could not find comic image')

    else:
        try:
            comicUrl = 'http:' + comicElem[0].get('src')             #getting comic url and then downloading its image
            print('Downloading image %s.....' %(comicUrl))
            res = requests.get(comicUrl)
            res.raise_for_status()

        except requests.exceptions.MissingSchema:
        #skip if not a normal image file
            prev = soup.select('a[rel="prev"]')[0]
            url = 'http://xkcd.com' + prev.get('href')
            continue

        imageFile = open(os.path.join('xkcd',os.path.basename(comicUrl)),'wb')     #write  downloaded image to hard disk
        for chunk in res.iter_content(10000):
            imageFile.write(chunk)
        imageFile.close()

        #get previous link and update url
        prev = soup.select('a[rel="prev"]')[0]
        url = "http://xkcd.com" + prev.get('href')


print('Done...')
Astonish answered 26/12, 2016 at 10:56 Comment(0)
C
1

Actually this is not a bigdeal.you can see the comicUrl somewhat like this //imgs.xkcd.com/comics/acceptable_risk.png

The only thing you need to add is http: , remember it is http: and not http:// as some folks said earlier because already the url contatin double slashes. so please change the code to

res = requests.get('http:' + comicElem[0].get('src'))

or

comicUrl = 'http:' + comicElem[0].get('src')

res = requests.get(comicUrl)

Happy coding

Capricorn answered 11/7, 2020 at 8:5 Comment(0)
U
0

Id just like to chime in here that I had this exact same error and used @Ajay recommended answer above but even after adding that I as still getting problems, right after the program downloaded the first image it would stop and return this error:

ValueError: Unsupported or invalid CSS selector: "a[rel"

this was referring to one of the last lines in the program where it uses the 'Prev button' to go to the next image to download.

Anyway after going through the bs4 docs I made a slight change as follows and it seems to work just fine now:

prevLink = soup.select('a[rel^="prev"]')[0]

Someone else might run into the same problem so thought Id add this comment.

Underwear answered 21/11, 2015 at 2:45 Comment(0)
I
0

I have a simmilar issure. it somehow take the responsecode 400 as url to parse from so its obvious that the url is invalid. here my code and error:

import cloudscraper  # to bypass cloudflare that is blocking requests with the request module
import time
import random
import json
import socket
from collections import OrderedDict
from requests import Session
 
 
with open("conf.json") as conf:
    config = json.load(conf)
    addon_api = config.get("Addon API")
    addonapi_url = config.get("Addon URL")
    addonapi_ip = config.get("Addon IP")
    addonapi_agent = config.get("Addon User-agent")
 
 
    # getip = socket.getaddrinfo("https://my.url.com", 443)
    # (family, type, proto, canonname, (address, port)) = getip[0]
    # family, type, proto, canonname, (address, port)) = getip[0]
 
    session = Session()
    headers = OrderedDict({
        'Accept-Encoding': 'gzip, deflate, br',
        'Host': addonapi_ip,
        'User-Agent': addonapi_agent
    })
    session.headers = headers
 
    # define the Data we will post to the Website
    data = {
        "apikey": addon_api,
        "action": "get_user_info",
        "value": "username"
    }
 
    try:  # try-block to handle exceptions if the request Failed
        randomsleep1 = random.randint(10, 30)
        randomsleep2 = random.randint(10, 30)
        randomsleep_total = randomsleep1 + randomsleep2
 
 
        data_variable = data
        headers_variable = headers
        payload = {"key1": addonapi_ip, "key2": data_variable, "key3": headers_variable}
 
        getrequest = session.get(url=addonapi_ip, data=data_variable, headers=headers_variable, params = payload)
        postrequest = session.get(url=addonapi_ip, data=data_variable, headers=headers_variable, params = payload)  # sending Data to the Website
        print(addonapi_ip)
 
        scraper = cloudscraper.create_scraper()  # returns a CloudScraper instance
        print(f"Sleeping for {randomsleep1} Seconds before posting Data to API!")
        time.sleep(randomsleep1)
        session.get(postrequest)  # sending Data to the Website
        print(f"Sleeping for {randomsleep2} Seconds before getting Data from API!")
        time.sleep(randomsleep2)
        print(f"Total Seconds i slept during the Request: {randomsleep_total}")
        session.post(postrequest)
        print(f"Data sent: {postrequest}")
        print(f"Data recived: {getrequest}")  # printing the output from the Request into our Terminal
 
 
    #    post = requests.post(addonapi_url, data=data, headers=headers)
    #    print(post.status_code)
    #    print(post.text)
 
    except Exception as e:
        raise e
        # print(e)  # print a error if occurced
# =========================================== #
Sleeping for 15 Seconds before posting Data to API!
Traceback (most recent call last):
  File "C:\Users\You.Dont.See.My.Name\PythonProjects\addon_bot\addon.py", line 69, in <module>
    raise e
  File "C:\Users\You.Dont.See.My.Name\PythonProjects\addon_bot\addon.py", line 55, in <module>
    session.get(postrequest)  # sending Data to the Website
  File "P:\Documents\IT\Python\lib\site-packages\requests\sessions.py", line 546, in get
    return self.request('GET', url, **kwargs)
  File "P:\Documents\IT\Python\lib\site-packages\requests\sessions.py", line 519, in request
    prep = self.prepare_request(req)
  File "P:\Documents\IT\Python\lib\site-packages\requests\sessions.py", line 452, in prepare_request
    p.prepare(
  File "P:\Documents\IT\Python\lib\site-packages\requests\models.py", line 313, in prepare
    self.prepare_url(url, params)
  File "P:\Documents\IT\Python\lib\site-packages\requests\models.py", line 387, in prepare_url
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '<Response [400]>': No schema supplied. Perhaps you meant http://<Response [400]>?
Ilianailine answered 9/6, 2021 at 10:29 Comment(0)
U
0

I was getting this error as well, but within the context of a class I was creating. In my case, I forgot to put "self" as the first parameter of the function I was creating, so when it was expecting a "URL", it was really getting an object instead.

Unmanned answered 14/6 at 6:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.