HTTP Error 999: Request denied
Asked Answered
F

2

9

I am trying to scrape some web pages from LinkedIn using BeautifulSoup and I keep getting error "HTTP Error 999: Request denied". Is there a way around to avoid this error. If you look at my code, I have tried Mechanize and URLLIB2 and both are giving me the same error.

from __future__ import unicode_literals
from bs4 import BeautifulSoup
import urllib2
import csv
import os
import re
import requests
import pandas as pd
import urlparse
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
from BeautifulSoup import BeautifulStoneSoup
import urllib
import urlparse
import pdb
import codecs
from BeautifulSoup import UnicodeDammit
import codecs
import webbrowser
from urlgrabber import urlopen
from urlgrabber.grabber import URLGrabber
import mechanize

fout5 = codecs.open('data.csv','r', encoding='utf-8', errors='replace')

for y in range(2,10,1):


    url = "https://www.linkedin.com/job/analytics-%2b-data-jobs-united-kingdom/?sort=relevance&page_num=1"

    params = {'page_num':y}

    url_parts = list(urlparse.urlparse(url))
    query = dict(urlparse.parse_qsl(url_parts[4]))
    query.update(params)

    url_parts[4] = urllib.urlencode(query)
    y = urlparse.urlunparse(url_parts)
    #print y



    #url = urllib2.urlopen(y)
    #f = urllib2.urlopen(y)

    op = mechanize.Browser() # use mecahnize's browser
    op.set_handle_robots(False) #tell the webpage you're not a robot
    j = op.open(y)
    #print op.title()


    #g = URLGrabber()
    #data = g.urlread(y)
    #data = fo.read()
    #print data

    #html = response.read()
    soup1 = BeautifulSoup(y)
    print soup1
Frankel answered 17/5, 2015 at 15:34 Comment(5)
This is why you don't break a site's ToS by scraping their content.Atthia
Mostly, deliberate bad robot behaviour of this kind needs to be discouraged: #tell the webpage you're not a robot.Oddity
@Oddity and how do you do that?Alvira
@Calion: Are you a site owner or a robot operator?Oddity
The latter, sort of. I have a Siri Shortcut that scrapes websites and will trigger 999 errors.Alvira
R
4

Try to set up User-Agent header. Add this line after op.set_handle_robots(False)

op.addheaders = [('User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36")]

Edit: If you want to scrape web-sites, first check if it has API or library, that deals with API.

Reformism answered 17/5, 2015 at 15:52 Comment(1)
Scraping using a falsified UA, or via proxies, or ignoring robots.txt, or too quickly, are all paths to operating what is known as a "badly behaved robot".Oddity
P
5

You should be using the LinkedIn REST API, either directly or using python-linkedin. It allows for direct access to the data, instead of attempting to scrape the JavaScript-heavy web site.

Pathological answered 17/5, 2015 at 15:51 Comment(1)
The thing is you need admin-rights of the company site to get the - public available - company info. This is pretty stupid.Compurgation
R
4

Try to set up User-Agent header. Add this line after op.set_handle_robots(False)

op.addheaders = [('User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36")]

Edit: If you want to scrape web-sites, first check if it has API or library, that deals with API.

Reformism answered 17/5, 2015 at 15:52 Comment(1)
Scraping using a falsified UA, or via proxies, or ignoring robots.txt, or too quickly, are all paths to operating what is known as a "badly behaved robot".Oddity

© 2022 - 2024 — McMap. All rights reserved.