Fastest way to download thousand files using python? [closed]

Asked 7/12, 2013 at 12:17 Answered 7/12, 2013 at 12:42

I need to download a thousand csv files size: 20KB - 350KB. Here is my code so far:

Im using urllib.request.urlretrieve. And with it i download thousand files with size of all of them together: 250MB, for over an hour.

So my question is:

How can I download thousand csv files faster then one hour?

Thank you!

Paling answered 7/12, 2013 at 12:17 Comment(5)

Are you breaking the yahoo license agreement? Have you checked? If not, they might be throttling your connection to prevent you doing this. – Androsterone 7/12, 2013 at 12:23

@joe i'm not downloading this from yahoo, it is just an example code – Paling 7/12, 2013 at 12:25

Your file sizes and file count doesn't add up. A thousand files @ 20KB means between 20 and 350 MB, not 5MB. – Kudva 7/12, 2013 at 12:46

@Lennart Regebro check my edit – Paling 7/12, 2013 at 12:49

Why did you remove the code in your question? – Rizzio 7/12, 2013 at 14:50

Most likely the reason it takes so long is that it takes time to open a connection make the request, get the file and close the connection again.

A thousand files in an hour is 3.6 seconds per file, which is high, but the site you are downloading from may be slow.

The first thing to do is to use HTTP/2.0 and keep one conection open for all the files with Keep-Alive. The easiest way to do that is to use the Requests library, and use a session.

If this isn't fast enough, then you need to do several parallel downloads with either multiprocessing or threads.

Kudva answered 7/12, 2013 at 12:40 Comment(4)

OP's using Python3.x and the links are for 2.x docs. – Thermaesthesia 7/12, 2013 at 12:50

And you need to have the bandwidth both server and client side even though some say it's unlikely. The server also needs to support keep alive. – Shipowner 7/12, 2013 at 15:38

@DerekLitz: It's less that 0.5Mb/s. Sure, the server could be overloaded or restricted, but with many small files, latency is going to be a significant factor here. – Kudva 7/12, 2013 at 16:21

@LennartRegebro Yeah, that is a good assumption :) – Shipowner 7/12, 2013 at 16:22

The issue is very unlikely to be bandwidth (connection speed) because any network connection can maintain that bandwidth. The issue is latency - the time it takes to establish a connection and set up your transfers. I know nothing about Python, but would suggest you split your list and run the queries in parallel if possible, on multiple threads or processes - since the issue is almost certainly neither CPU, nor bandwidth-bound. So, I am saying fire off multiple requests in parallel so a bunch of setups can all be proceeding at the same time and the time each takes is masked behind another.

By the way, if your thousand files amount to 5MB, then they are around 5kB each, rather than the 20kB to 350kB you say.

Sphenogram answered 7/12, 2013 at 12:42 Comment(0)

You should try using multithreading to download many files in parallel. Have a look at multiprocessing and especially the worker-pools.

Retain answered 7/12, 2013 at 12:41 Comment(0)

You are probably not going to be able to top that speed without either a) a faster internet connection both for you and the provider or b) getting the provider to provide a zip or tar.gz format of the files that you need.

The other possibility would be to use a cloud service such as Amazon to get the files to your cloud location, zip or compress them there and then download the zip file to your local machine. As the cloud service is on the internet backbone it should have faster service than you. The downside is you may end up having to pay for this depending on the service you use.

Clipfed answered 7/12, 2013 at 12:33 Comment(1)

Faster internet connection than 5MB an hour? :-) I don't think that's the problem. Unless he is actually on a 14.4kb modem. – Kudva 7/12, 2013 at 12:41

Recommended topics

Hot tags