Python: urllib.error.HTTPError: HTTP Error 404: Not Found
Asked Answered
L

5

15

I wrote a script to find spelling mistakes in SO questions' titles. I used it for about a month.This was working fine.

But now, when I try to run it, I am getting this.

Traceback (most recent call last):
  File "copyeditor.py", line 32, in <module>
    find_bad_qn(i)
  File "copyeditor.py", line 15, in find_bad_qn
    html = urlopen(url)
  File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.4/urllib/request.py", line 469, in open
    response = meth(req, response)
  File "/usr/lib/python3.4/urllib/request.py", line 579, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.4/urllib/request.py", line 507, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.4/urllib/request.py", line 587, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

This is my code

import json
from urllib.request import urlopen
from bs4 import BeautifulSoup
from enchant import DictWithPWL
from enchant.checker import SpellChecker

my_dict = DictWithPWL("en_US", pwl="terms.dict")
chkr = SpellChecker(lang=my_dict)
result = []


def find_bad_qn(a):
    url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"
    html = urlopen(url)
    bsObj = BeautifulSoup(html, "html5lib")
    que = bsObj.find_all("div", class_="question-summary")
    for div in que:
        link = div.a.get('href')
        name = div.a.text
        chkr.set_text(name.lower())
        list1 = []
        for err in chkr:
            list1.append(chkr.word)
        if (len(list1) > 1):
            str1 = ' '.join(list1)
            result.append({'link': link, 'name': name, 'words': str1})


print("Please Wait.. it will take some time")
for i in range(298314,298346):
    find_bad_qn(i)
for qn in result:
    qn['link'] = "https://stackoverflow.com" + qn['link']
for qn in result:
    print(qn['link'], " Error Words:", qn['words'])
    url = qn['link']

UPDATE

This is the url causing the problem.Even though this url exists.

https://stackoverflow.com/questions?page=298314&sort=active

I tried changing the range to some lower values. It works fine now.

Why this happened with above url?

Longanimity answered 24/2, 2017 at 14:32 Comment(3)
can you print the requested url from which you had this error please ?Ahola
This one stackoverflow.com/questions?page=298314&sort=activeLonganimity
This is actually strange, I can reproduce the exact same problem for every url page above around 270000. The pages exist in but I get an error when requesting with pythonAhola
M
10

So apparently the default display number of questions per page is 50 so the range you defined in the loop goes out of the available number of pages with 50 questions per page. The range should be adapted to be within the number of total pages with 50 questions each.

This code will catch the 404 error which was the reason you got an error and ignore it just in case you go out of the range.

from urllib.request import urlopen

def find_bad_qn(a):
    url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"
    try:
        urlopen(url)
    except:
        pass

print("Please Wait.. it will take some time")
for i in range(298314,298346):
    find_bad_qn(i)
Matson answered 24/2, 2017 at 14:41 Comment(8)
But that url exists.Longanimity
No, it returns a 404 error code which means the url wasn't found. That is your error: urllib.error.HTTPError: HTTP Error 404: Not FoundMatson
Yes. But that url exists. You can try it. My range value is not question id. It is page number in active questionsLonganimity
I don't know what to tell you, the error you got is telling you that the url doesn't exist and if I click on the link you copied above I get a Page Not FoundMatson
Maybe your url is badly formed?Matson
I dont know why you are getting page not found and I am getting that page with no problem.Longanimity
Oh I got the reason. You are viewing 50 qn per page. I am veiwing 30 qn per page. That why I hv no problem and you got pagenot foundLonganimity
Yeah I just changed it and now it finds it. Maybe the default is 50 pages then and that's why your program returns 404.Matson
B
8

I have exactly the same problem. The url that I want to get using urllib exists and is accessible using normal browser, but urllib is telling me 404.

The solution for me is not use urllib:

import requests
requests.get(url)

This works for me.

Barhorst answered 29/12, 2018 at 15:2 Comment(0)
F
6

The default 'User-Agent' doesn't seem to have as much access as Mozilla.

Try importing Request and append , headers={'User-Agent': 'Mozilla/5.0'} to the end of your url.

ie:

from urllib.request import Request, urlopen    
url = f"https://stackoverflow.com/questions?page={str(a)}&sort=active"    
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})    
html = urlopen(req)
Fluxion answered 13/4, 2020 at 4:10 Comment(0)
O
2

It is because URL doesn't exist please recheck your URL. I also had same issue during rechecking I found that my URL is not right then I changed it

Outpost answered 15/6, 2021 at 10:6 Comment(1)
There is already an accepted answer. I'm sure it wasn't a typo issue.Ligan
T
-1

Check by clicking on the link . Maybe it is present in the code that means there is no problem with your code but actually the link or site is not there that is not found.

Tarsia answered 1/1, 2023 at 14:24 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Welloff

© 2022 - 2024 — McMap. All rights reserved.