An alternative web crawler to Nutch [closed]
Asked Answered
S

5

20

I'm trying to build a specialised search engine web site that indexes a limited number of web sites. The solution I came up with is:

  • using Nutch as the web crawler,
  • using Solr as the search engine,
  • the front-end and the site logic is coded with Wicket.

The problem is that I find Nutch quite complex and it's a big piece of software to customise, despite the fact that a detailed documentation (books, recent tutorials.. etc) does just not exist.

Questions now:

  1. Any constructive criticism about the hole idea of the site?
  2. Is there a good yet simple alternative to Nutch (as the crawling part of the site)?

Thanks

Saideman answered 24/11, 2010 at 17:24 Comment(1)
For years we've tried eveything: Nutch, Heritrix, Storm Crawler, crawler4j, our own in-house crawler... However, there's only one truly impressive alternative out there that our entire team swears by: Mixnode.Renunciation
W
4

Scrapy is a python library that crawls web sites. It is fairly small (compared to Nutch) and designed for limited site crawls. It has a Django type MVC style that I found pretty easy to customize.

Workaday answered 24/11, 2010 at 17:57 Comment(0)
C
4

For the crawling part, I really like anemone and crawler4j. They both allow you to add your custom logic for links selection and page handling. For each page that you decide to keep, you can easily add the call to Solr.

Crotchety answered 27/2, 2011 at 14:35 Comment(0)
R
4

It depends on how many web sites and so URLs you think crawl. Apache Nutch stores page documents on Apache HBase (which relies on Apache Hadoop), it's solid but very hard to setup and administrate.

Since a crawler is just a page fetch (like a CURL) and retrieve list of links to feed your URLs data base, I am sure you can write a crawler on your own (especially if you have a few web sites), use a simple MySQL database (maybe a queue software like RabbitMQ to schedule the crawl jobs).

On other side, a crawler could be more sophisticated, you could want to remove from your HTML document the HEAD part, and keep only the real "content" of the page etc...

Also, Nutch can rank your pages, with a PageRank algo., you could use Apache Spark to do the same thing (more efficiently because Spark can cache data in memory).

Ratfink answered 2/5, 2014 at 8:38 Comment(0)
V
2

In, C#, but a lot simpler and you can communicate directly with the author. (me)

I used to use Nutch and you are correct; it is a bear to work with.

http://arachnode.net

Vassaux answered 3/3, 2013 at 20:33 Comment(1)
I tried it and its not simpler.Festivity
B
0

I do believe the nutch is the best choice for you application, but if you want, there is a simple tool: Heritrix. Besides that, I recommand js for the front-end language, because solr returns json which is easily handled by js.

Bagasse answered 13/8, 2014 at 7:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.