How can I speed up fetching pages with urllib2 in python?
Asked Answered
H

11

29

I have a script that fetches several web pages and parses the info.

(An example can be seen at http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01 )

I ran cProfile on it, and as I assumed, urlopen takes up a lot of time. Is there a way to fetch the pages faster? Or a way to fetch several pages at once? I'll do whatever is simplest, as I'm new to python and web developing.

Thanks in advance! :)

UPDATE: I have a function called fetchURLs(), which I use to make an array of the URLs I need so something like urls = fetchURLS().The URLS are all XML files from Amazon and eBay APIs (which confuses me as to why it takes so long to load, maybe my webhost is slow?)

What I need to do is load each URL, read each page, and send that data to another part of the script which will parse and display the data.

Note that I can't do the latter part until ALL of the pages have been fetched, that's what my issue is.

Also, my host limits me to 25 processes at a time, I believe, so whatever is easiest on the server would be nice :)


Here it is for time:

Sun Aug 15 20:51:22 2010    prof

         211352 function calls (209292 primitive calls) in 22.254 CPU seconds

   Ordered by: internal time
   List reduced from 404 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10   18.056    1.806   18.056    1.806 {_socket.getaddrinfo}
     4991    2.730    0.001    2.730    0.001 {method 'recv' of '_socket.socket' objects}
       10    0.490    0.049    0.490    0.049 {method 'connect' of '_socket.socket' objects}
     2415    0.079    0.000    0.079    0.000 {method 'translate' of 'unicode' objects}
       12    0.061    0.005    0.745    0.062 /usr/local/lib/python2.6/HTMLParser.py:132(goahead)
     3428    0.060    0.000    0.202    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1306(endData)
     1698    0.055    0.000    0.068    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1351(_smartPop)
     4125    0.053    0.000    0.056    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:118(setup)
     1698    0.042    0.000    0.358    0.000 /usr/local/lib/python2.6/HTMLParser.py:224(parse_starttag)
     1698    0.042    0.000    0.275    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1397(unknown_starttag)
Hydromedusa answered 16/8, 2010 at 2:3 Comment(0)
K
30

EDIT: I'm expanding the answer to include a more polished example. I have found a lot hostility and misinformation in this post regarding threading v.s. async I/O. Therefore I also adding more argument to refute certain invalid claim. I hope this will help people to choose the right tool for the right job.

This is a dup to a question 3 days ago.

Python urllib2.open is slow, need a better way to read several urls - Stack Overflow Python urllib2.urlopen() is slow, need a better way to read several urls

I'm polishing the code to show how to fetch multiple webpage in parallel using threads.

import time
import threading
import Queue

# utility - spawn a thread to execute target for each args
def run_parallel_in_threads(target, args_list):
    result = Queue.Queue()
    # wrapper to collect return value in a Queue
    def task_wrapper(*args):
        result.put(target(*args))
    threads = [threading.Thread(target=task_wrapper, args=args) for args in args_list]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return result

def dummy_task(n):
    for i in xrange(n):
        time.sleep(0.1)
    return n

# below is the application code
urls = [
    ('http://www.google.com/',),
    ('http://www.lycos.com/',),
    ('http://www.bing.com/',),
    ('http://www.altavista.com/',),
    ('http://achewood.com/',),
]

def fetch(url):
    return urllib2.urlopen(url).read()

run_parallel_in_threads(fetch, urls)

As you can see, the application specific code has only 3 lines, which can be collapsed into 1 line if you are aggressive. I don't think anyone can justify their claim that this is complex and unmaintainable.

Unfortunately most other threading code posted here has some flaws. Many of them do active polling to wait for the code to finish. join() is a better way to synchronize the code. I think this code has improved upon all the threading examples so far.

keep-alive connection

WoLpH's suggestion about using keep-alive connection could be very useful if all you URLs are pointing to the same server.

twisted

Aaron Gallagher is a fans of twisted framework and he is hostile any people who suggest thread. Unfortunately a lot of his claims are misinformation. For example he said "-1 for suggesting threads. This is IO-bound; threads are useless here." This contrary to evidence as both Nick T and I have demonstrated speed gain from the using thread. In fact I/O bound application has the most to gain from using Python's thread (v.s. no gain in CPU bound application). Aaron's misguided criticism on thread shows he is rather confused about parallel programming in general.

Right tool for the right job

I'm well aware of the issues pertain to parallel programming using threads, python, async I/O and so on. Each tool has their pros and cons. For each situation there is an appropriate tool. I'm not against twisted (though I have not deployed one myself). But I don't believe we can flat out say that thread is BAD and twisted is GOOD in all situations.

For example, if the OP's requirement is to fetch 10,000 website in parallel, async I/O will be prefereable. Threading won't be appropriable (unless maybe with stackless Python).

Aaron's opposition to threads are mostly generalizations. He fail to recognize that this is a trivial parallelization task. Each task is independent and do not share resources. So most of his attack do not apply.

Given my code has no external dependency, I'll call it right tool for the right job.

Performance

I think most people would agree that performance of this task is largely depend on the networking code and the external server, where the performance of platform code should have negligible effect. However Aaron's benchmark show an 50% speed gain over the threaded code. I think it is necessary to response to this apparent speed gain.

In Nick's code, there is an obvious flaw that caused the inefficiency. But how do you explain the 233ms speed gain over my code? I think even twisted fans will refrain from jumping into conclusion to attribute this to the efficiency of twisted. There are, after all, a huge amount of variable outside of the system code, like the remote server's performance, network, caching, and difference implementation between urllib2 and twisted web client and so on.

Just to make sure Python's threading will not incur a huge amount of inefficiency, I do a quick benchmark to spawn 5 threads and then 500 threads. I am quite comfortable to say the overhead of spawning 5 thread is negligible and cannot explain the 233ms speed difference.

In [274]: %time run_parallel_in_threads(dummy_task, [(0,)]*5)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
Out[275]: <Queue.Queue instance at 0x038B2878>

In [276]: %time run_parallel_in_threads(dummy_task, [(0,)]*500)
CPU times: user 0.16 s, sys: 0.00 s, total: 0.16 s
Wall time: 0.16 s

In [278]: %time run_parallel_in_threads(dummy_task, [(10,)]*500)
CPU times: user 1.13 s, sys: 0.00 s, total: 1.13 s
Wall time: 1.13 s       <<<<<<<< This means 0.13s of overhead

Further testing on my parallel fetching shows a huge variability in the response time in 17 runs. (Unfortunately I don't have twisted to verify Aaron's code).

0.75 s
0.38 s
0.59 s
0.38 s
0.62 s
1.50 s
0.49 s
0.36 s
0.95 s
0.43 s
0.61 s
0.81 s
0.46 s
1.21 s
2.87 s
1.04 s
1.72 s

My testing does not support Aaron's conclusion that threading is consistently slower than async I/O by a measurable margin. Given the number of variables involved, I have to say this is not a valid test to measure the systematic performance difference between async I/O and threading.

Konopka answered 16/8, 2010 at 6:5 Comment(24)
See the other comment I just left: I never said that threads can't be effective in this situation. It's just not worth the problems with threads that everyone seems to forget or ignore in their answers. Here is an enlightening graphic: erights.org/elib/concurrency/images/badtradeoff.gifCalysta
This isn't an answer, it's three comments. Please don't abuse the Q/A system, comment as necessary.Ludivinaludlew
@Aaron, you say thread is useless twice. Something cannot be useless if it is effective.Konopka
@Devin, I'm going to expand my response to contain both an answer and rebuke arguments. There is significant amount of misinformation in this discussion. Unfortunately I need more space than a small piece of comment for a rebuke. This is an adaptive use of the Q/A system. Mostly important I want people to choose the right tool for the right job, and not to reject a tool due to misinformation.Konopka
Wai Yip Tung, Amazing write up, you explained quite a bit, I appreciate that! I tried running your code, I just can't figure out what the data (the urlopen().read() ) is called.Hydromedusa
Scratch that, I got it now! I'm going to put it into my script and I'll let you know how it goes.Hydromedusa
Hmm, I seem to be getting <class 'Queue.Empty'>: args = () message = '' When I put your code in my script. Do I have to have the () and the , around each URL? If I do that, it kind of works, but one of the URLs becomes invalid.Hydromedusa
@Parker, I've polished the code even more. I posted it as a recipe on ASPN code.activestate.com/recipes/…. It should be really easy to use.Konopka
@Parker, I just read about your problem. Try the ASPN's version. I have adopted the map interface and dropped of the more clunky Queue and the (,) issue that has tripped you.Konopka
"Each task is independent and do not share resources." So, why use threads at all? Threads "share resources" by definition. Maybe this whole time you've been trying to suggest using a process pool for fetching pages, but were unaware of the differences between processes and threads.Calysta
@Wai, that's a false dichotomy. Threads are useless here because they add extra complexity that an event loop wouldn't add. Did you look at the graphic I posted in my first comment?Calysta
Jeez, this answer is just chock full of wrong. I keep finding more things. One of the biggest problems with threads is that they can appear to be simple, when there's a lot of things going on that aren't obvious at all. You claim that because your code is three lines long, it can't be complex or unmaintainable. I can't vouch for whether urllib2 is thread-safe, but there's a number of things that aren't thread safe, and will break in subtle ways when run with run_parallel_in_threads. The complexity is still there, but deferred to other places.Calysta
And I'm really confused as to how you can say that threading is not consistently slower than async IO when you didn't even test async IO. I'm the only one who's posted benchmarks using twisted, and my benchmarks do show a consistent difference.Calysta
I don't need to test async I/O. The reason I say that is the range of my own testing is differ by as much as 2.51s, it will not be valid for someone to claim an alternative solution is consistently faster by a much smaller margin. Unless the alternative code is slower than this code by a margin a lot greater than 2.51, then we can claim it is consistently slower than this code.Konopka
@Aaron, you have not found any problem. You just making FUD claims. If you found anything in urllib2, or any other part of Python that's not thread safe, please file a bug. There are tons of production software using threads. If Python is not designed to be thread safe they will all be idiot to use it in production.Konopka
Threads "share resources" by definition? In what sense? Can you point out what resources it is sharing in this code? And please don't make stupid suggestion that I confuse process with thread. I cannot possibly be such as idiot.Konopka
Wai Yip Tung, I cannot express enough how much of a life saver you have been. You constantly came back to address any issues I had and you went out of your way to explain everything and why I should use what and how to use it. I really appreciate what you've done for me! Thanks so much!Hydromedusa
Off-topic; how does this answer have 5 downvotes? Twisted fans are quite vindictive. :PHanlon
I don't know, but it kind of scares me off from using the module :P I don't see what the "issue with threads" is. For my purposes, I'm just loading a few URLs, I don't need to import some massive library when I can just fetch them in parallel. Also, the threading lets me wait for all of the pages to be fetched before moving on, whereas Twisted will try to keep going and cause issues.Hydromedusa
@Nick, it is not just me. Almost everyone gets a down votes, presumably on the ground that thread is bad. Usually people on stackoverflow are quite civilized even when they disagree. This is the only group of people who go a great length to vote people down.Konopka
@WaiYipTung, as you scale up the concurrent connections you start to hit serious problems with a thread-based approach. Firstly you hit resource problems due to the comparatively high memory consumption of each thread, and secondly you'll hit locking problems as all those threads concurrently try to write their output to shared data structures or resources. Single-threaded apps don't have to deal with locking, and event-based frameworks mean you don't need threads to parallelise I/O.Aenneea
I just used this to improve a script I wrote from 38 seconds to 3 seconds! That's the difference between using this script a lot and using it only when I absolutely need to. Best of all it took 10 minutes to add and didn't require any additional packages. Thanks!Wheatworm
@WaiYipTung Quick question, if I am using selenium webdriver ,setting cookies and then fetching these pages, by spawning multiple threads, will this still work ? The point is that I do have javascript that I need to wait for to be executed, once I have ensured that I wait for the presence of the corresponding classes, Can I still wrap the rest around your example above ? My use case is scraping some elements, post the java script is executed on these pages. I could post this as an additional question, if need be.Crenelate
@WaiYipTung Or, a larger question would be is Selenium Webdriver thread safe ? the answer to which "A: WebDriver is not thread-safe. Having said that, if you can serialise access to the underlying driver instance, you can share a reference in more than one thread. This is not advisable. You /can/ on the other hand instantiate one WebDriver instance for each thread." But if my urls can go fetch in parallel, does that matter, will the threadsCrenelate
C
19

Use twisted! It makes this kind of thing absurdly easy compared to, say, using threads.

from twisted.internet import defer, reactor
from twisted.web.client import getPage
import time

def processPage(page, url):
    # do somewthing here.
    return url, len(page)

def printResults(result):
    for success, value in result:
        if success:
            print 'Success:', value
        else:
            print 'Failure:', value.getErrorMessage()

def printDelta(_, start):
    delta = time.time() - start
    print 'ran in %0.3fs' % (delta,)
    return delta

urls = [
    'http://www.google.com/',
    'http://www.lycos.com/',
    'http://www.bing.com/',
    'http://www.altavista.com/',
    'http://achewood.com/',
]

def fetchURLs():
    callbacks = []
    for url in urls:
        d = getPage(url)
        d.addCallback(processPage, url)
        callbacks.append(d)

    callbacks = defer.DeferredList(callbacks)
    callbacks.addCallback(printResults)
    return callbacks

@defer.inlineCallbacks
def main():
    times = []
    for x in xrange(5):
        d = fetchURLs()
        d.addCallback(printDelta, time.time())
        times.append((yield d))
    print 'avg time: %0.3fs' % (sum(times) / len(times),)

reactor.callWhenRunning(main)
reactor.run()

This code also performs better than any of the other solutions posted (edited after I closed some things that were using a lot of bandwidth):

Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 29996)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.518s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.461s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30033)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.435s
Success: ('http://www.google.com/', 8117)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.449s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.547s
avg time: 0.482s

And using Nick T's code, rigged up to also give the average of five and show the output better:

Starting threaded reads:
...took 1.921520 seconds ([8117, 30070, 15043, 8386, 28611])
Starting threaded reads:
...took 1.779461 seconds ([8135, 15043, 8386, 30349, 28611])
Starting threaded reads:
...took 1.756968 seconds ([8135, 8386, 15043, 30349, 28611])
Starting threaded reads:
...took 1.762956 seconds ([8386, 8135, 15043, 29996, 28611])
Starting threaded reads:
...took 1.654377 seconds ([8117, 30349, 15043, 8386, 28611])
avg time: 1.775s

Starting sequential reads:
...took 1.389803 seconds ([8135, 30147, 28611, 8386, 15043])
Starting sequential reads:
...took 1.457451 seconds ([8135, 30051, 28611, 8386, 15043])
Starting sequential reads:
...took 1.432214 seconds ([8135, 29996, 28611, 8386, 15043])
Starting sequential reads:
...took 1.447866 seconds ([8117, 30028, 28611, 8386, 15043])
Starting sequential reads:
...took 1.468946 seconds ([8153, 30051, 28611, 8386, 15043])
avg time: 1.439s

And using Wai Yip Tung's code:

Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30051 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.704s
Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30114 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.845s
Fetched 8153 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30070 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.689s
Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30114 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.647s
Fetched 8135 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30349 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.693s
avg time: 0.715s

I've gotta say, I do like that the sequential fetches performed better for me.

Calysta answered 16/8, 2010 at 3:20 Comment(17)
I do like that I've gotten -2 with no comments! Come on, downvoters, try to show that my code is bad~Calysta
No downvote from me since it's a proper solution. But do you have a plain Python version instead of using the huge Twisted framework?Mesquite
Your benchmarks a mildly flawed imho. You are benchmarking the great search engines which will always respond nearly instant. When using your solution with normal websites the sequential fetches will perform worse because than the bottleneck will be on the server side/internet instead of your Python code.Mesquite
@WoLpH, I modified the other code I tested to request the same sites. See how the lengths are all basically the same?Calysta
@WoLpH, also, "huge"? Twisted is quite a bit smaller than python.Calysta
I am not disputing that. I am saying that most websites will not respond as fast as the major search engines. When testing with any regular website with lots of content your results will be completely different. Sequentially fetching results will only be faster in cases like these where your Python code is actually the bottleneck.Mesquite
According to SLOCCount the twisted source has 144,898 physical source lines of code. Such a codebase is huge in my book.If you want the person that asked the question to actually understand the code he's using, it will be hard to read through all used code in Twisted.Mesquite
@WoLpH, that's probably including all of the unit tests, and all of the optional packages that wouldn't be necessary for something as simple as fetching web pages. And, again, I guarantee that the number of lines used in python itself just to invoke urllib2 is going to be greater. And regarding the benchmarks, I could pick another bunch of sites, but the original code used only docs.python.org, which is dog slow and a bit unreliable on my connection.Calysta
@Aaron Gallagher: Yes, there is bunch of other code in Twisted that is not used here. But it doesn't negate the fact that the amount of code you'll have to read through with Twisted will be substantial. As for reading urllib2, that's not the point here. The working of urllib2 is not the question, it's the working of either a threading approach or the async approach. well... that's the point. On slow websites the results are completely different. And most websites are closer to docs.python.org than google.com in terms of performance.Mesquite
I would use Twisted, but I want to print the data of the fetched pages in a certain order. Can I do this with twisted? It seems I might reach the part of the script that pints the info before it actually arrives. Can I make my script pause until the data is received?Hydromedusa
@Parker, that's exactly what the DeferredList does. Here's a link to another answer I wrote that describes how it works a bit better: #3489354Calysta
Hmm, I tried your example code, but the script just hung and wouldn;t do anything.Hydromedusa
@Parker: The reactor endlessly runs, there are probably some ways to have it automatically die when you're done, but if you're doing a quick script it's overkill. If your application bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01 just needs to get 4 pages(?) then firing off a thread for each is fine. Aaron seems to be hell-bent on Twisted (and blanket-declaring threads as "useless"); while it's a fine tool, it is not needed to replace every instance of urllib.Hanlon
Speaking of which, you could toss some example pages you're getting into the question to provide more applicable data. (versus instantly getting homepages from massive websites, which would be where threads don't perform as well)Hanlon
Nick, thanks! I updated the question for you. I've been trying the threads but the script just locks up and doesn;t do anything.Hydromedusa
@Parker, If you have a large list of urls this approach may not work well for you as it opens one connection per url more or less simultaneously. This may be causing your internet connection to choke up. Try running a smaller number of urls at a time to see if that helpsMinaret
I'm not sure why 5 different websites were used for the tests... OP clearly states "The URLS are all XML files from Amazon and eBay APIs", and thinking that using 5 sequential connections against 2 hosts will be faster than 5 simultaneous connections goes against the conventional wisdom of nearly every web browser and download manager out there...Mukluk
M
5

Here is an example using python Threads. The other threaded examples here launch a thread per url, which is not very friendly behaviour if it causes too many hits for the server to handle (for example it is common for spiders to have many urls on the same host)

from threading import Thread
from urllib2 import urlopen
from time import time, sleep

WORKERS=1
urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']*10
results = []

class Worker(Thread):
    def run(self):
        while urls:
            url = urls.pop()
            results.append((url, urlopen(url).read()))

start = time()
threads = [Worker() for i in range(WORKERS)]
any(t.start() for t in threads)

while len(results)<40:
    sleep(0.1)
print time()-start

Note: The times given here are for 40 urls and will depend a lot on the speed of your internet connection and the latency to the server. Being in Australia, my ping is > 300ms

With WORKERS=1 it took 86 seconds to run
With WORKERS=4 it took 23 seconds to run
with WORKERS=10 it took 10 seconds to run

so having 10 threads downloading is 8.6 times as fast as a single thread.

Here is an upgraded version that uses a Queue. There are at least a couple of advantages.
1. The urls are requested in the order that they appear in the list
2. Can use q.join() to detect when the requests have all completed
3. The results are kept in the same order as the url list

from threading import Thread
from urllib2 import urlopen
from time import time, sleep
from Queue import Queue

WORKERS=10
urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']*10
results = [None]*len(urls)

def worker():
    while True:
        i, url = q.get()
        # print "requesting ", i, url       # if you want to see what's going on
        results[i]=urlopen(url).read()
        q.task_done()

start = time()
q = Queue()
for i in range(WORKERS):
    t=Thread(target=worker)
    t.daemon = True
    t.start()

for i,url in enumerate(urls):
    q.put((i,url))
q.join()
print time()-start
Minaret answered 16/8, 2010 at 7:57 Comment(15)
Not according to my benchmarks. I think you're doing something wrong.Calysta
@Aaron, The program is right there. It's pretty simple. Why do you think I am doing something wrong?Minaret
Well, I'm just going to stop right after "you're not using any thread-safe data structures to communicate through threads" because that's a painfully amateurish mistake.Calysta
@Aaron, Do you mean list.pop() and list.append()? They are guaranteed to be thread safe in Python.Minaret
Guaranteed by whom? Do you have a link to a document that espouses this? (And if it were true, why does the Queue module exist?)Calysta
effbot.org/zone/thread-synchronization.htm#atomic-operations Using a Queue here instead would not be difficult. A list is adequate for this simple example as I don't mind fetching the url's in reverse order. Obviously popping from the beginning of a list over and over is not very efficientMinaret
@gnibbler, this document is completely incorrect. I can't say that it's just out of date because, well, I don't know when this was ever true. None of these operations are atomic; I can point out times when each one of them could yield execution to another thread. (A simple example: reading an instance attribute has at least a dozen different ways to invoke arbitrary python. __getattr__ is one everyone knows of.) This would be too long for just one comment, so if you were to make a new question for this, I'd be glad to list the problems.Calysta
I would like to listen to you to show how list.append is not atomic. I've looked at the byte code - dis.dis(compile("[].append(1)","","exec")). The append happens in instruction #9. It looks atomic to me.Konopka
@Aaron, Queue does more than transferring data atomically. It is a bounded buffer, meaning it can block producer or consumer until data or space is available for synchronization purpose.Konopka
gnibbler, thanks! :) However, when I run the script (the second one) I copied word for word, and just replaced the URLs with urls = getURLS(), it just keeps running. It won;t display anything or stop.Hydromedusa
@Parker, have you tried adding the print statement where I indicated? How many urls does getURLS return? Perhaps it is just taking a long time.Minaret
@Wai, just because it's implemented in one opcode in one part of the bytecode doesn't mean that it's an atomic operation. Calling a python function, for example, only takes one opcode. Would you say that calling an arbitrary python function is atomic?Calysta
@Wai, and, uh, I never disagreed on what Queue is for? I don't understand what you're trying to say with that comment.Calysta
@Aaron, about Queue, let me remind you of the context. You was challenging gnibbler's claim that list.append is atomic. And you say Queue would not exist if list.append is atomic. I was reminding you that Queue's primary purpose is to implement bounded buffer.Konopka
@Aaron, about if the append function call is atomic. I think anyone with good sense will design a fundamental operation like append as single step, not compose of other python steps. But if you don't believe it, fair enough, let's look at the Python source code. (svn.python.org/view/python/tags/r27rc2/Objects/…) Append is implemented by PyList_Append(). It looks pretty sane for me. No release of GIL. No calling of other Python function. I stopped tracing when it gets to PyMem_RESIZE. But I think they will be insane to release the GIL there.Konopka
T
3

Since this question was posted it looks like there's a higher level abstraction available, ThreadPoolExecutor:

https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example

The example from there pasted here for convenience:

import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the url and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))

There's also map which I think makes the code easier: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map

Tinge answered 11/12, 2015 at 17:32 Comment(2)
you could also use multiprocessing.pool.ThreadPool that is available even on Python 2. Here's code example.Tangelatangelo
you have to install futures, to make it available in Python 2Tangelatangelo
M
2

The actual wait is probably not in urllib2 but in the server and/or your network connection to the server.

There are 2 ways of speeding this up.

  1. Keep the connection alive (see this question on how to do that: Python urllib2 with keep alive)
  2. Use multiplle connections, you can use threads or an async approach as Aaron Gallagher suggested. For that, simply use any threading example and you should do fine :) You can also use the multiprocessing lib to make things pretty easy.
Mesquite answered 16/8, 2010 at 2:8 Comment(19)
Thanks WoLpH! Much appreciated :) Will keeping the connection alive work even if I'm grabbing different web pages?Hydromedusa
@Parker: the up arrow to the left of the arrow says "This answer is useful" and at 15+ reputation, you can click it in addition to the accept checkmark.Improvisation
-1 for suggesting threads. This is IO-bound; threads are useless here.Calysta
@Aaron, usually threads work brilliantly for downloading webpages. The process won't be I/O bound unless it's downloading really large files or the latency is very low. urllib2 will typically spend most of it's time blocked, waiting for a response which is perfect conditions for Pythons GIL/threadingMinaret
@gnibbler, no, that's what IO bound means: the process spends most of its time waiting on IO. Multiple threads don't make you wait for data any faster. Just use nonblocking IO; there's no extra code complexity or locking overhead.Calysta
@Aaron, sure you are correct about the definition of IO Bound, but wrong about the effectiveness of threading to download a bunch of urls.Minaret
@gnibbler, I never once said it wasn't effective. I've only been claiming that it's not worth the numerous pitfalls and caveats associated with it, which most of the answers conveniently gloss over or ignore.Calysta
@Aaron Gallagher: threads are far from useless here. Yes it is IO bound, but not your IO. It's the latency, bandwidth and server response time that will limit you. Threading is by far the most effective way of making a download system faster.Mesquite
@Parker: keeping the connection alive will only work as long as you stay on the same server. If the new page will be on a different webserver than it will fail. So a good guess is... if it is still on the same domain and subdomain, than you can keep the connection alive. Otherwise you probably don't want to try.Mesquite
@WoLpH, did you miss the part where I posted benchmarks of twisted beating out two different implementations of threaded downloaders? Twisted runs in a single thread, using an event loop. There is no locking overhead. Asynchronous IO, through an event loop, is the most effective way of making a download faster.Calysta
@Aaron Gallagher: yes, I didn't see your answer yet. Letting it work async is a nice way of fixing the problem indeed. On the lower level it doesn't make much difference however. Using async is a nice way of emulating multiple threads indeed. And for this purpose a better solution. However... I would have opted for a plain Python solution instead of using the huge Twisted framework.Mesquite
@WoLpH, uh, event loops are not in any way trying to "emulate" multiple threads. If anything, threads try to "emulate" being an event loop by trying to turn a blocking API into something that can be used asynchronous.Calysta
@Aaron Gallagher: Alright, rephrased to emulate the behaviour of multiple threads running simultaneously. If you want to go down a couple of levels than you will simply end up with CPython executing everything sequentially on a single processor and switching as soon as one of the threads blocks. The system is quite comparable really, with threads your method gets cpu time once it stops blocking. With async your method gets called once it stops blocking. It is just a different interface for the same technology.Mesquite
@WoLpH: Twisted is a plain python solution -- it is written in python.Belletrist
@nosklo: Do you seriously don't know what I meant or are you just trying to start a pointless debate? Either way, when I was talking about a plain Python solution I meant using the Python base library versus the use of a huge library.Mesquite
@WoLpH: Twisted is written in python and opensource. You could just copy what twisted does to inside your code -- then you'll have a plain python solution, using only the base library. The part that does this isn't that big.Belletrist
@nosklo: I'll have to take your word for it not being that big. My experience with Twisted has been quite the opposite.Mesquite
@WoLpH: That's irrelevant anyway. Point is that the best, fastest, correct™ way of doing multiple downloads in parallel, using only python, is to do what twisted does: use asynchronous code. You could write it yourself, but Twisted is already written and well tested, so why not use it? My hard drive is 500GB so it fits twisted many times, the size does not matterBelletrist
@nosklo: The size does matter. No, not in terms of storage ofcourse. But in terms of readability. A huge codebase will take more time to learn/understand than a small codebase. So an asyncore example would be more fit here.Mesquite
P
2

Most of the answers focused on fetching multiple pages from different servers at the same time (threading) but not on reusing already open HTTP connection. If OP is making multiple request to the same server/site.

In urlib2 a separate connection is created with each request which impacts performance and and as a result slower rate of fetching pages. urllib3 solves this problem by using a connection pool. Can read more here urllib3 [Also thread-safe]

There is also Requests an HTTP library that uses urllib3

This combined with threading should increase the speed of fetching pages

Premaxilla answered 20/12, 2013 at 13:19 Comment(0)
F
1

Nowadays there is excellent Python lib that do this for you called requests.

Use standard api of requests if you want solution based on threads or async api (using gevent under the hood) if you want solution based on non-blocking IO.

Farmyard answered 27/1, 2012 at 13:31 Comment(0)
S
1

Here's a standard library solution. It's not quite as fast, but it uses less memory than the threaded solutions.

try:
    from http.client import HTTPConnection, HTTPSConnection
except ImportError:
    from httplib import HTTPConnection, HTTPSConnection
connections = []
results = []

for url in urls:
    scheme, _, host, path = url.split('/', 3)
    h = (HTTPConnection if scheme == 'http:' else HTTPSConnection)(host)
    h.request('GET', '/' + path)
    connections.append(h)
for h in connections:
    results.append(h.getresponse().read())

Also, if most of your requests are to the same host, then reusing the same http connection would probably help more than doing things in parallel.

Shults answered 23/11, 2014 at 16:2 Comment(0)
A
1

Please find Python network benchmark script for single connection slowness identification:

"""Python network test."""
from socket import create_connection
from time import time

try:
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen

TIC = time()
create_connection(('216.58.194.174', 80))
print('Duration socket IP connection (s): {:.2f}'.format(time() - TIC))

TIC = time()
create_connection(('google.com', 80))
print('Duration socket DNS connection (s): {:.2f}'.format(time() - TIC))

TIC = time()
urlopen('http://216.58.194.174')
print('Duration urlopen IP connection (s): {:.2f}'.format(time() - TIC))

TIC = time()
urlopen('http://google.com')
print('Duration urlopen DNS connection (s): {:.2f}'.format(time() - TIC))

And example of results with Python 3.6:

Duration socket IP connection (s): 0.02
Duration socket DNS connection (s): 75.51
Duration urlopen IP connection (s): 75.88
Duration urlopen DNS connection (s): 151.42

Python 2.7.13 has very similar results.

In this case, DNS and urlopen slowness are easily identified.

Anselma answered 1/2, 2017 at 18:28 Comment(0)
W
1

Ray offers an elegant way to do this (in both Python 2 and Python 3). Ray is a library for writing parallel and distributed Python.

Simply define the fetch function with the @ray.remote decorator. Then you can fetch a URL in the background by calling fetch.remote(url).

import ray
import sys

ray.init()

@ray.remote
def fetch(url):
    if sys.version_info >= (3, 0):
        import urllib.request
        return urllib.request.urlopen(url).read()
    else:
        import urllib2
        return urllib2.urlopen(url).read()

urls = ['https://en.wikipedia.org/wiki/Donald_Trump',
        'https://en.wikipedia.org/wiki/Barack_Obama',
        'https://en.wikipedia.org/wiki/George_W._Bush',
        'https://en.wikipedia.org/wiki/Bill_Clinton',
        'https://en.wikipedia.org/wiki/George_H._W._Bush']

# Fetch the webpages in parallel.
results = ray.get([fetch.remote(url) for url in urls])

If you also want to process the webpages in parallel, you can either put the processing code directly into fetch, or you can define a new remote function and compose them together.

@ray.remote
def process(html):
    tokens = html.split()
    return set(tokens)

# Fetch and process the pages in parallel.
results = []
for url in urls:
    results.append(process.remote(fetch.remote(url)))
results = ray.get(results)

If you have a very long list of URLs that you want to fetch, you may wish to issue some tasks and then process them in the order that they complete. You can do this using ray.wait.

urls = 100 * urls  # Pretend we have a long list of URLs.
results = []

in_progress_ids = []

# Start pulling 10 URLs in parallel.
for _ in range(10):
    url = urls.pop()
    in_progress_ids.append(fetch.remote(url))

# Whenever one finishes, start fetching a new one.
while len(in_progress_ids) > 0:
    # Get a result that has finished.
    [ready_id], in_progress_ids = ray.wait(in_progress_ids)
    results.append(ray.get(ready_id))
    # Start a new task.
    if len(urls) > 0:
        in_progress_ids.append(fetch.remote(urls.pop()))

View the Ray documentation.

Willard answered 5/2, 2019 at 2:33 Comment(1)
Unfortunately, this doesn't work for Windows as support for Windows has not been released.Annetteannex
H
0

Fetching webpages obviously will take a while as you're not accessing anything local. If you have several to access, you could use the threading module to run a couple at once.

Here's a very crude example

import threading
import urllib2
import time

urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']
data1 = []
data2 = []

class PageFetch(threading.Thread):
    def __init__(self, url, datadump):
        self.url = url
        self.datadump = datadump
        threading.Thread.__init__(self)
    def run(self):
        page = urllib2.urlopen(self.url)
        self.datadump.append(page.read()) # don't do it like this.

print "Starting threaded reads:"
start = time.clock()
for url in urls:
    PageFetch(url, data2).start()
while len(data2) < len(urls): pass # don't do this either.
print "...took %f seconds" % (time.clock() - start)

print "Starting sequential reads:"
start = time.clock()
for url in urls:
    page = urllib2.urlopen(url)
    data1.append(page.read())
print "...took %f seconds" % (time.clock() - start)

for i,x in enumerate(data1):
    print len(data1[i]), len(data2[i])

This was the output when I ran it:

Starting threaded reads:
...took 2.035579 seconds
Starting sequential reads:
...took 4.307102 seconds
73127 19923
19923 59366
361483 73127
59366 361483

Grabbing the data from the thread by appending to a list is probably ill-advised (Queue would be better) but it illustrates that there is a difference.

Hanlon answered 16/8, 2010 at 2:8 Comment(4)
And may I ask why self.datadump.append(page.read()) # don't do it like this. is ill advised?Hydromedusa
-1 for suggesting threads. This is IO-bound; threads are useless here.Calysta
@Aaron Gallagher Why did it run over twice as fast using threads?Hanlon
I never denied that your code can execute in less time. The problem is that the means by which you achieve that produce unsustainable, overcomplicated code compared to using async IO.Calysta

© 2022 - 2025 — McMap. All rights reserved.