Python + requests + splinter: What's the fastest/best way to make multiple concurrent 'get' requests?
Asked Answered
B

4

7

Currently taking a web scraping class with other students, and we are supposed to make ‘get’ requests to a dummy site, parse it, and visit another site.

The problem is, the content of the dummy site is only up for several minutes and disappears, and the content comes back up at a certain interval. During the time the content is available, everyone tries to make the ‘get’ requests, so mine just hangs until everyone clears up, and the content eventually disappears. So I end up not being able to successfully make the ‘get’ request:

import requests
from splinter import Browser    

browser = Browser('chrome')

# Hangs here
requests.get('http://dummysite.ca').text
# Even if get is successful hangs here as well
browser.visit(parsed_url)

So my question is, what's the fastest/best way to make endless concurrent 'get' requests until I get a response?

Bias answered 5/5, 2017 at 16:26 Comment(5)
The 'speed' of your request is not a problem. The problem seems to be with the responsiveness (or lack thereof) of the server. In my view, there is nothing you can do to reliably win the race and would depend on other factors between you and the server. At best, you could try sending many concurrent requests, say every second or so, to improve your chances. Honestly though, I see no reason why such a server wouldn't be able to handle requests from all the students in a class concurrently, unless it is by design or the server does not have the resources required or is misconfigured.Boohoo
@Boohoo Appreciate your insight and response. If that is the case, do you mind showing how to go about sending many concurrent requests as suggested? Might as well give that a shot, and so that I can upvote/accept answer as well.Bias
@Boohoo Checking in to see if you've seen my previous response. Thank you in advance!Bias
Relevant SO Question: #43902593Aila
This is a little out of the box, but what about lobbying the other students to stop DDoS'ing your woefully-underresourced toy server, setting up a Amazon Lambda (or similar) routine to occasionally copy its data to someplace else than can handle the traffic, and then hitting that?Meghannmegiddo
A
1
  1. Decide to use either requests or splinter

    Read about Requests: HTTP for Humans
    Read about Splinter

  2. Related

    Read about keep-alive
    Read about blocking-or-non-blocking
    Read about timeouts
    Read about errors-and-exceptions

If you are able to get not hanging requests, you can think of repeated requests, for instance:

while True:
    requests.get(...
    if request is succesfull:
        break

    time.sleep(1)
Aila answered 10/5, 2017 at 13:7 Comment(5)
How about #2633020 ? And what if I need to use both requests and splinters? Or can one do everything the other can?Bias
I you have reading Point 1, you can decide which one fitt your needs. Up to now you want only retrieve the content of a url .Aila
Could you clarify on reading Point 1? Rather than just a working solution -- in fact I do have a working one as well, I'm actually looking for the most optimal, so would appreciate if you can clarify the ups and downs for each suggestion.Bias
Checking in to see if you've seen my previous comment. In the sample you provided, shouldn't requests.get(.. have a closing )?Bias
@JoKo: Look at the three dots ..., its not completed - it's just to show something has to follow.Aila
S
1

Gevent provides a framework for running asynchronous network requests.

It can patch Python's standard library so that existing libraries like requests and splinter work out of the box.

Here is a short example of how to make 10 concurrent requests, based on the above code, and get their response.

from gevent import monkey
monkey.patch_all()
import gevent.pool
import requests

pool = gevent.pool.Pool(size=10)
greenlets = [pool.spawn(requests.get, 'http://dummysite.ca')
             for _ in range(10)]
# Wait for all requests to complete
pool.join()
for greenlet in greenlets:
    # This will raise any exceptions raised by the request
    # Need to catch errors, or check if an exception was
    # thrown by checking `greenlet.exception`
    response = greenlet.get()
    text_response = response.text

Could also use map and a response handling function instead of get.

See gevent documentation for more information.

Sandglass answered 12/5, 2017 at 10:34 Comment(4)
How about #2633020?Bias
What about it? Gevent can easily handle tens of thousands of requests. Look up asynchronous network I/O and particularly the relation to number of available sockets, blocking code overhead and event loops to find out more.Sandglass
would there be a way of doing it with splinter? Because, ultimately would like to use something like find_by_tag('option'), etc.Bias
@Joko - same method works the same way with splinter, or any other standard python library using library. Just change the requests import and make target of pool.spawn be the splinter function equivalent.Sandglass
M
1

In this situation, concurrency will not help much since the server seems to be the limiting factor. One solution is to send a request with a timeout interval, if the interval has exceeded, then try the request again after a few seconds. Then gradually increase the time between retries until you get the data that you want. For instance, your code might look like this:

import time
import requests

def get_content(url, timeout):
    # raise Timeout exception if more than x sends have passed
    resp = requests.get(url, timeout=timeout)
    # raise generic exception if request is unsuccessful
    if resp.status_code != 200:
        raise LookupError('status is not 200')
    return resp.content


timeout = 5 # seconds
retry_interval = 0
max_retry_interval = 120
while True:
    try:
        response = get_content('https://example.com', timeout=timeout)
        retry_interval = 0        # reset retry interval after success
        break
    except (LookupError, requests.exceptions.Timeout):
        retry_interval += 10
        if retry_interval > max_retry_interval:
            retry_interval = max_retry_interval
        time.sleep(retry_interval)

# process response

If concurrency is required, consider the Scrapy project. It uses the Twisted framework. In Scrapy you can replace time.sleep with reactor.callLater(fn, *args, **kw) or use one of hundreds of middleware plugins.

Misfortune answered 15/5, 2017 at 15:44 Comment(2)
Would just like to keep trying until I get back the information that I need. In that case, wouldn't concurrency be more effective? Also, could you elaborate on what it means to send with a timeout interval? Does it mean if the request does not get back any response within the timeout interval defined, the request should just be dropped? Thank you in advanceBias
Usually concurrency is used to get content from multiple URLs not just a single URL. Or if you have other tasks that could be running while the request is being retrieved. I'm not sure you would get any benefit from concurrent design patterns for your use case. The timeout value is the number of seconds requests will wait for a response, after which, the request will be dropped.Misfortune
B
0

From the documentation for requests:

If the remote server is very slow, you can tell Requests to wait forever for a response, by passing None as a timeout value and then retrieving a cup of coffee.

import requests

#Wait potentially forever
r = requests.get('http://dummysite.ca', timeout=None)

#Check the status code to see how the server is handling the request
print r.status_code

Status codes beginning with 2 mean the request was received, understood, and accepted. 200 means the request was a success and the information returned. But 503 means the server is overloaded or undergoing maintenance.

Requests used to include a module called async which could send concurrent requests. It is now an independent module named grequests which you can use to make concurrent requests endlessly until a 200 response:

import grequests

urls = [
'http://python-requests.org', #Just include one url if you want
'http://httpbin.org',
'http://python-guide.org',
'http://kennethreitz.com'
]

def keep_going():
    rs = (grequests.get(u) for u in urls) #Make a set of unsent Requests
    out = grequests.map(rs) #Send them all at the same time
    for i in out:
        if i.status_code == 200:
            print i.text
            del urls[out.index(i)] #If we have the content, delete the URL
            return

while urls:
    keep_going() 
Blowhard answered 15/5, 2017 at 23:10 Comment(4)
What should I change in the case of requesting to a same URL but many times? And could you elaborate on doing *10**8?Bias
The final code section will work fine with one url in the urls list. The *10**8 means that the url in the list which preceds it is multiplied 100 million times. The get requests are then sent 100 at a time. 10**8 is python for 10 to the power of 8, which is 100,000,000. Sorry that was just an example as to why the server might be running very slow.Blowhard
Or you could send concurrent get requests to the same url by having only one url in urls and multiplying the list by an integer e.g. urls = ['www.example.com']*10 in the final block of code above.Blowhard
I have removed the part of the answer about *10**8.Blowhard

© 2022 - 2024 — McMap. All rights reserved.