Recrawl URL with Nutch just for updated sites

Asked 10/1, 2013 at 15:40 Answered 13/1, 2013 at 9:50

Solved apache solr lucene nutch web-crawler

I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?

Israelisraeli answered 10/1, 2013 at 15:40 Comment(0)

Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz.

You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other words Nutch generates new binary files to save HTMLs. I don't think it's possible to compare binary files, as Nutch combines all crawl results within a single file. If you want to save pages in raw HTML format to compare, see my answer to this question.

Skippie answered 10/1, 2013 at 15:45 Comment(4)

How JOB scheduler compare the crawling if it is updated or it is same? i mean how nutch or solar compare the content? – Dorthadorthea 10/1, 2013 at 15:47

So, every page should be checked if there are some changes compare with the old one and if there is new stuff, then the page will be crawled. If I understand right, I just need simple function for this that will compare strings? – Israelisraeli 10/1, 2013 at 16:9

That's correct. But you might be looking for a change in a specific are in the page, after you get the raw HTML right, you can easily determine what to do. – Sherd 10/1, 2013 at 16:22

I disagree on that Nutch provides the ability to detect the pages that are new and updated and should be able to do this for you. – Knockout 11/1, 2013 at 6:2

You have to Schedule ta Job for Firing the Job
However, Nutch AdaptiveFetchSchedule should enable you to crawl and index pages and detect whether the page is new or updated and you don't have to do it manually.

Article describes the same in detail.

Knockout answered 11/1, 2013 at 6:5 Comment(8)

Ok, I read the article and I have another question.Do I have to use any job sheduler for run my command for crawl the given url or I need Adaptive Fetch scheduler to do this? And if the Adaptive Fetch is the right one how can I use it? – Israelisraeli 11/1, 2013 at 16:0

you can configure adaptice schedule wihtin in config. And you would need a scheduler to fire the job e.g. Autosys, Quartz etc. – Knockout 11/1, 2013 at 17:13

I will have to disagree with you here. The class you mention works according to the crawled site's "if-modified-since" and "last-modified" http headers. And I must tell, none of the sites around (except for google, youtube, stackoverflow etc.) mustn't be trusted on the truthfulness of these headers. – Sherd 11/1, 2013 at 17:20

If you are building the site, its upon you to take care of this so that crawling works fine for you. – Knockout 12/1, 2013 at 9:44

I don't really understand you here. You mean you're crawling your own website, you yourself made? Why? :) – Sherd 12/1, 2013 at 11:25

Why not :) We had huge number of Intranet and news sites. We want to allow people of search through this sites and we use nutch incremental indexing as we cannot index all the content always. Here we can control to indicate to Nutch when the page was updated. – Knockout 12/1, 2013 at 11:56

@IsmetAlkan: I think Jayendra is right. It is explained in the article. I don't think AdapativeFetchSchedule just relies on "if-modified-since" and "last-modified" http headers. From the article - Each time a page is fetched, Nutch computes a signature for the page. At the next fetch, if the signature is the same (OR if a 304 is returned by the web server because of the If-Modified-Since header), Nutch can tell if the page was modified or not. – Courtneycourtrai 30/4, 2014 at 20:49

More importantly the last line in this - "By default the signature of a page is built not only with its content, but also with the http headers returned with the page. So even if the content of a page has not changed, if an http header is not the same (like an etag or a date), the signature changes. To solve that problem, there is the TextProfileSignature class. It is designed to look only at the text content of a page to build the signature." – Courtneycourtrai 30/4, 2014 at 20:50

what about http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/

This is discussed on : How to recrawle nutch

I am wondering if the above mentioned solution will indeed work. I am trying as we speak. I crawl news-sites and they update their frontpage quite frequently, so I need to re-crawl the index/frontpage often and fetch the newly discovered links.

Breger answered 13/1, 2013 at 9:50 Comment(1)

What are you actually thinking recommending the same article that is recommended in a previous answer? – Sherd 13/1, 2013 at 12:35

Recommended topics

Hot tags