HTTP Error 999: Request denied

Asked 17/5, 2015 at 15:34 Answered 17/5, 2015 at 15:52

Solved python web-scraping beautifulsoup linkedin-api mechanize

I am trying to scrape some web pages from LinkedIn using BeautifulSoup and I keep getting error "HTTP Error 999: Request denied". Is there a way around to avoid this error. If you look at my code, I have tried Mechanize and URLLIB2 and both are giving me the same error.

from __future__ import unicode_literals
from bs4 import BeautifulSoup
import urllib2
import csv
import os
import re
import requests
import pandas as pd
import urlparse
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
from BeautifulSoup import BeautifulStoneSoup
import urllib
import urlparse
import pdb
import codecs
from BeautifulSoup import UnicodeDammit
import codecs
import webbrowser
from urlgrabber import urlopen
from urlgrabber.grabber import URLGrabber
import mechanize

fout5 = codecs.open('data.csv','r', encoding='utf-8', errors='replace')

for y in range(2,10,1):


    url = "https://www.linkedin.com/job/analytics-%2b-data-jobs-united-kingdom/?sort=relevance&page_num=1"

    params = {'page_num':y}

    url_parts = list(urlparse.urlparse(url))
    query = dict(urlparse.parse_qsl(url_parts[4]))
    query.update(params)

    url_parts[4] = urllib.urlencode(query)
    y = urlparse.urlunparse(url_parts)
    #print y



    #url = urllib2.urlopen(y)
    #f = urllib2.urlopen(y)

    op = mechanize.Browser() # use mecahnize's browser
    op.set_handle_robots(False) #tell the webpage you're not a robot
    j = op.open(y)
    #print op.title()


    #g = URLGrabber()
    #data = g.urlread(y)
    #data = fo.read()
    #print data

    #html = response.read()
    soup1 = BeautifulSoup(y)
    print soup1

Frankel answered 17/5, 2015 at 15:34 Comment(5)

This is why you don't break a site's ToS by scraping their content. – Atthia 17/5, 2015 at 15:54

Mostly, deliberate bad robot behaviour of this kind needs to be discouraged: #tell the webpage you're not a robot. – Oddity 18/5, 2015 at 8:33

@Oddity and how do you do that? – Alvira 3/10, 2022 at 15:30

@Calion: Are you a site owner or a robot operator? – Oddity 3/10, 2022 at 17:58

The latter, sort of. I have a Siri Shortcut that scrapes websites and will trigger 999 errors. – Alvira 4/10, 2022 at 13:27

Try to set up User-Agent header. Add this line after op.set_handle_robots(False)

op.addheaders = [('User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36")]

Edit: If you want to scrape web-sites, first check if it has API or library, that deals with API.

Reformism answered 17/5, 2015 at 15:52 Comment(1)

Scraping using a falsified UA, or via proxies, or ignoring robots.txt, or too quickly, are all paths to operating what is known as a "badly behaved robot". – Oddity 18/5, 2015 at 8:32

You should be using the LinkedIn REST API, either directly or using python-linkedin. It allows for direct access to the data, instead of attempting to scrape the JavaScript-heavy web site.

Pathological answered 17/5, 2015 at 15:51 Comment(1)

The thing is you need admin-rights of the company site to get the - public available - company info. This is pretty stupid. – Compurgation 29/5, 2015 at 15:4

Try to set up User-Agent header. Add this line after op.set_handle_robots(False)

op.addheaders = [('User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36")]

Edit: If you want to scrape web-sites, first check if it has API or library, that deals with API.

Reformism answered 17/5, 2015 at 15:52 Comment(1)

Scraping using a falsified UA, or via proxies, or ignoring robots.txt, or too quickly, are all paths to operating what is known as a "badly behaved robot". – Oddity 18/5, 2015 at 8:32

Recommended topics

Hot tags