Does any open, simply extendible web crawler exists?

D

4

7

I search for a web crawler solution which can is mature enough and can be simply extended. I am interested in the following features... or possibility to extend the crawler to meet them:

partly just to read the feeds of several sites
to scrape the content of these sites
if the site has an archive I would like to crawl and index it as well
the crawler should be capable to explore part of the Web for me and it should be able to decide which sites matches the given criteria
should be able to notify me, if things possibly matching my interest were found
the crawler should not kill the servers by attacking it by too many requests, it should be smart doing crawling
the crawler should be robust against freak sites and servers

Those things above can be done one by one without any big effort, but I am interested in any solution which provide a customisable, extendible crawler. I heard of Apache Nutch, but very unsure about the project so far. Do you have experiences with it? Can you recommend alternatives?

Discovert answered 18/1, 2010 at 10:11 Comment(0)

I

2

A quick search on GitHub threw up Anemone, a web spider framework which seems to fit your requirements - particularly extensiblility. Written in Ruby.
Hope it goes well!

Isomerize answered 18/1, 2010 at 21:24 Comment(1)

seems to be a good stuff, I like that it is ruby, which i like, the author created a nice dsl for crawlers. but compared to nutch still I don't see rss feed support and things like pdf crawling. but it is extendible. thanks for sharing the reference to anemone. – Discovert 19/1, 2010 at 8:31

P

4

I've used Nutch extensively, when I was building the open source project index for my Krugle startup. It's hard to customize, being a fairly monolithic design. There is a plug-in architecture, but the interaction between plug-ins and the system is tricky and fragile.

As a result of that experience, and needing something with more flexibility, I started the Bixo project - a web mining toolkit. http://openbixo.org.

Whether it's right for you depends on the weighting of factors such as:

How much flexibility you need (+)
How mature it should be (-)
Whether you need the ability to scale (+)
If you're comfortable with Java/Hadoop (+)

Phonography answered 31/1, 2010 at 15:47 Comment(0)

D

2

I heartily recommend heritrix. It is VERY flexible and I'd argue is the most battle tested freely available open source crawler, as it's the one the Internet Archive uses.

Deka answered 18/1, 2010 at 10:32 Comment(0)

T

2

You should be able to find something that fits your needs here.

Tutankhamen answered 18/1, 2010 at 11:3 Comment(2)

are these things only created in java? – Kimono 18/1, 2010 at 13:32

The article is titled, "Open Source Web Crawlers Written in Java". However, you can find web crawlers built in other languages that may provide you with what you need. – Rubetta 26/1, 2010 at 16:8

I

2

A quick search on GitHub threw up Anemone, a web spider framework which seems to fit your requirements - particularly extensiblility. Written in Ruby.
Hope it goes well!

Isomerize answered 18/1, 2010 at 21:24 Comment(1)

seems to be a good stuff, I like that it is ruby, which i like, the author created a nice dsl for crawlers. but compared to nutch still I don't see rss feed support and things like pdf crawling. but it is extendible. thanks for sharing the reference to anemone. – Discovert 19/1, 2010 at 8:31

Recommended topics

Hot tags