Fast internet crawler

T

5

8

I'd like to do perform data mining on a large scale. For this, I need a fast crawler. All I need is something to download a web page, extract links and follow them recursively, but without visiting the same url twice. Basically, I want to avoid looping.

I already wrote a crawler in python, but it's too slow. I'm not able to saturate a 100Mbit line with it. Top speed is ~40 urls/sec. and for some reason it's hard to get better results. It seems like a problem with python's multithreading/sockets. I also ran into problems with python's gargabe collector, but that was solvable. CPU isn't the bottleneck btw.

So, what should I use to write a crawler that is as fast as possible, and what's the best solution to avoid looping while crawling?

EDIT: The solution was to combine multiprocessing and threading modules. Spawn multiple processes with multiple threads per process for best effect. Spawning multiple threads in a single process is not effective and multiple processes with just one thread consume too much memory.

Tartan answered 4/10, 2011 at 19:51 Comment(7)

You won't get better results using python's multi-threading past a certain point because of the global interpreter lock. Also, I'll bet you won't be able to saturate a 100Mbit line without retrieving duplicates. Long story short you're prematurely optimizing. – Chouest 4/10, 2011 at 19:56

Have you checked out Scrapy? It's pretty awesome for this sort of thing.. scrapy.org – Tramroad 4/10, 2011 at 20:2

@Falmarri: please elaborate why do you think I won't be able to saturate. If a page has ~50KB on avg, then I need to process ~200 urls/sec. to sature. Do you think it's a problem? – Tartan 4/10, 2011 at 20:3

Can you have multiple crawlers run at the same time? – Tramroad 4/10, 2011 at 20:7

@pbp: The problem is that in order to saturate you have to be crawling 100% of the time. But you have to do some processing on incoming data to determine if the links you're seeing are duplicates before you send out your crawler. I made this a comment because you can probably get something like 95% or even 99% saturation, but not 100%. But before we go into details, you should probably give us your actual numbers. – Chouest 5/10, 2011 at 22:13

@Falmarri: that's false. Your reasoning can be used to prove that it's impossible to saturate lines with arbitrarily small bandwidth (substitute 1MBit line for 100Mbit line). – Tartan 26/10, 2011 at 11:2

Note that web-crawling is ''not'' data mining. Data mining is a very statistics heavy analysis method, see Wikipedia. This sounds to me just as a regular web spider. – Sohn 14/10, 2012 at 13:34

B

1

It sounds like you have a design problem more than a language problem. Try looking into the multiprocessing module for accessing more sites at the same time rather than threads. Also, consider getting some table to store your previously visited sites (a database maybe?).

Balkan answered 4/10, 2011 at 19:57 Comment(2)

Using multiprocessing helped. With 100 processes, I'm getting ~20Mbit of traffic. The problem is memory load -- python interpreter takes ~7MB of memory, which is a lot. – Tartan 5/10, 2011 at 10:42

Well, you are most definitely using too many processes. Most parallel algorithms only spawn as many processes as it has cores. If you spend time waiting for websites, spawn a few more. Try to find some balance that produces a high pages/second (probably use Scrapy too, like others suggested). – Balkan 5/10, 2011 at 16:32

W

8

Why not use something already tested for crawling, like Scrapy? I managed to reach almost 100 pages per second on a low-end VPS that has limited RAM memory (about 400Mb), while network speed was around 6-7 Mb/s (i.e. below 100Mbps).

Another improvement you can do is use urllib3 (especially when crawling many pages from a single domain). Here's a brief comparison I did some time ago:

UPDATE:

Scrapy now uses the Requests library, which in turn uses urllib3. That makes Scrapy the absolute go-to tool when it comes to scraping. Recent versions also support deploying projects, so scraping from a VPS is easier than ever.

Waldo answered 4/10, 2011 at 20:2 Comment(3)

100 pages for a specific (single) domain? – Tartan 4/10, 2011 at 21:36

Yes, 100 pages for a single domain hosted in the same country as the crawler itself (Germany). – Waldo 4/10, 2011 at 21:39

Scrapy has never used the Requests module because Scrapy is written on top of Twisted. While Scrapy is fine for most scraping tasks, it has little support for distributed crawling, making it somewhat useless for larger-scale crawls. – Carlita 2/5, 2014 at 3:7

T

3

Around 2 years ago i have developed a crawler. And it can download almost 250urls per second. You could flow my steps.

Optimize your file pointer use. Try to use minimal file pointer.
Don't write your data every time. Try to dump your data after storing around 5000 url or 10000 url.
For your robustness you don't need to use different configuration. Try to Use a log file and when you want to resume then just try to read the log file and resume your crawler.
Distributed all your webcrawler task. And process it in a interval wise.

a. downloader

b. link extractor

c. URLSeen

d. ContentSeen

Thorncombe answered 18/7, 2012 at 10:57 Comment(0)

E

2

I have written a simple multithreading crawler. It is available on GitHub as Discovering Web Resources and I've written a related article: Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website. You can change the number of threads being used in the NWORKERS class variable. Don't hesitate to ask any further question if you need extra help.

Ebro answered 25/10, 2012 at 17:59 Comment(0)

B

1

It sounds like you have a design problem more than a language problem. Try looking into the multiprocessing module for accessing more sites at the same time rather than threads. Also, consider getting some table to store your previously visited sites (a database maybe?).

Balkan answered 4/10, 2011 at 19:57 Comment(2)

Using multiprocessing helped. With 100 processes, I'm getting ~20Mbit of traffic. The problem is memory load -- python interpreter takes ~7MB of memory, which is a lot. – Tartan 5/10, 2011 at 10:42

Well, you are most definitely using too many processes. Most parallel algorithms only spawn as many processes as it has cores. If you spend time waiting for websites, spawn a few more. Try to find some balance that produces a high pages/second (probably use Scrapy too, like others suggested). – Balkan 5/10, 2011 at 16:32

U

1

Impossible to tell what your limitations are. Your problem is similiar to the C10K problem -- read first, don't optimize straight away. Go for the low-hanging fruit: Most probably you get significant performance improvements by analyzing your application design. Don't start out massively-mulithreaded or massively-multiprocessed.

I'd use Twisted to write the the networking part, this can be very fast. In general, I/O on the machine has to be better than average. Either you have to write your data to disk or to another machine, not every notebook supports 10MByte/s sustained database writes. Lastly, if you have an asynchronous internet connection, It might simply be that your upstream is saturated. ACK priorization helps here (OpenBSD example).

Unni answered 4/10, 2011 at 19:59 Comment(0)

UPDATE:

Recommended topics

Hot tags