Is there a way to run cpython on a diffident thread without risking a crash?

Asked 1/9, 2012 at 14:55 Answered 2/9, 2012 at 23:53

python windows urllib cpython python-multithreading

I have a program that runs lots of urllib requests IN AN INFINITE LOOP, which makes my program really slow, so I tried putting them as threads. Urllib uses cpython deep down in the socket module, so the threads that are being created just add up and do nothing because python's GIL prevents a two cpython commands from being executed in diffident threads at the same time. I am running Windows XP with Python 2.5, so I can't use the multiprocess module. I tried looking at the subproccess module to see if there was a way to execute python code in a subprocess somehow, but nothing. If anyone has a way that I can create a python subprocess through a function like in the multiprocess, that would be great.

Also, I would rather not download an external module, but I am willing to.

EDIT: Here is a sample of some code in my current program.

    url = "http://example.com/upload_image.php?username=Test&password=test"
    url = urllib.urlopen(url, data=urllib.urlencode({"Image": raw_image_data})).read()
    if url.strip().replace("\n", "") != "":
        print url

I did a test and it turns out that urllib2's urlopen with the Request object and without is still as slow or slower. I created my own custom timeit like module and the above takes around 0.5-2 seconds, which is horrible for what my program does.

Brahui answered 1/9, 2012 at 14:55 Comment(4)

Is your whole program written in CPython? Would you mind going a bit more into detail with your question about what you want to achieve, what you do, and where the problem is? – Alienate 1/9, 2012 at 15:0

Guido rejected the idea of removing the GIL cause it would change too much the implementation, so I don't think you'll be able to "remove GIL". This would require a complete rewrite of most CPython code. A simple solution could be to use processes instead of threads. Launch 4-5 processes for the requests, save the result on files and then use the files from a "main process". – Vouchsafe 1/9, 2012 at 15:47

I like the idea of a diffident thread. Perhaps a timid or shy thread that it's really confident enough to execute code. – Nonaggression 1/9, 2012 at 20:44

Use non-blocking IO instead of threads. There are many answers here explaining why non-blocking IO is better than threads in your scenario. You could start with this. – Collarbone 1/9, 2012 at 21:16

Urllib uses cpython deep down in the socket module, so the threads that are being created just add up and do nothing because python's GIL prevents a two cpython commands from being executed in diffident threads at the same time.

Wrong. Though It is a common misconception. CPython can and do release GIL for IO-operations (look at all Py_BEGIN_ALLOW_THREADS in the socketmodule.c). While one thread waits for IO to complete other threads can do some work. If urllib calls are the bottleneck in your script then threads may be one of the acceptable solutions.

I am running Windows XP with Python 2.5, so I can't use the multiprocess module.

You could install Python 2.6 or newer or if you must use Python 2.5; you could install multiprocessing separately.

I created my own custom timeit like module and the above takes around 0.5-2 seconds, which is horrible for what my program does.

The performance of urllib2.urlopen('http://example.com...).read() depends mostly on outside factors such as DNS, network latency/bandwidth, performance of example.com server itself.

Here's an example script which uses both threading and urllib2:

import urllib2
from Queue import Queue
from threading import Thread

def check(queue):
    """Check /n url."""
    opener = urllib2.build_opener() # if you use install_opener in other threads
    for n in iter(queue.get, None):
        try:
            data = opener.open('http://localhost:8888/%d' % (n,)).read()
        except IOError, e:
            print("error /%d reason %s" % (n, e))
        else:
            "check data here"

def main():
    nurls, nthreads = 10000, 10

    # spawn threads
    queue = Queue()
    threads = [Thread(target=check, args=(queue,)) for _ in xrange(nthreads)]
    for t in threads:
        t.daemon = True # die if program exits
        t.start()

    # provide some work
    for n in xrange(nurls): queue.put_nowait(n)
    # signal the end
    for _ in threads: queue.put(None)
    # wait for completion
    for t in threads: t.join()

if __name__=="__main__":
   main()

To convert it to a multiprocessing script just use different imports and your program will use multiple processes:

from multiprocessing import Queue
from multiprocessing import Process as Thread

# the rest of the script is the same

Stoker answered 2/9, 2012 at 23:53 Comment(2)

My program has an infinite loop that continues until the program is "Control-C ed" or escaped in some other way. I tried this, and it seams to only work when the threads are joined afterwords, but that would be just as bad because it waits for the thread, right? – Brahui 4/9, 2012 at 20:34

@user1474837: no. queue.put(None) signals that threads should exit. The programs exits then all items are processed. – Stoker 4/9, 2012 at 20:45

If you want multi threading, Jython could be an option, as it doesn't have a GIL.

I concur with @Jan-Philip and @Piotr. What are you using urllib for?

Unexceptionable answered 1/9, 2012 at 22:9 Comment(0)

Recommended topics

Hot tags