Apache Nutch 2.1 different batch id (null)
Asked Answered
L

1

8

I crawl few sites with Apache Nutch 2.1.

While crawling I see the following message on lot of pages:
ex. Skipping http://www.domainname.com/news/subcategory/111111/index.html; different batch id (null).

What causes this error ?
How can I resolve this problem, because the pages with different batch id (null) are not stored in database.

The site that I crawled is based on drupal, but i have tried on many others non drupal sites.

Lysias answered 12/2, 2013 at 8:33 Comment(3)
Have you been able to resolve this?Unconsidered
No. I tried several weeks, but without success. After that I stopped using Nutch. Like alternative you can use php crawler: link linkLysias
I found a workaround that fits my needs. Python scrapey is great as well: scrapy.orgUnconsidered
E
1

I think, the message is not problem. batch_id not assigned to all of url. So, if batch_id is null , skip url. Generate url when batch_id assined for url.

Estes answered 18/4, 2013 at 9:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.