nutch 1.10 input path does not exist /linkdb/current
Asked Answered
O

1

6

When I run nutch 1.10 with the following command, assuming that TestCrawl2 did not previously exist and needs to be created,...

sudo -E bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TestCrawlCore2 urls/ TestCrawl2/ 20

I receive an error on indexing that claims:

Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/apache-nutch-1.10/TestCrawl2/linkdb/current

The linkdb directory exists, but does not contain the 'current' directory. The directory is owned by root so there should be no permissions issues. Because the process exited from an error, the linkdb directory contains .locked and ..locked.crc files. If I run the command again, these lock files cause it to exit in the same place. Delete TestCrawl2 directory, rinse, repeat.

Note that the nutch and solr installaions themselves have run previously without problems in a TestCrawl instance. It's just now that I'm trying a new one that I'm having problems. Any suggestions on troubleshooting this issue?

Orphism answered 3/11, 2015 at 20:44 Comment(0)
O
3

Ok, it seems as though I have run into a version of this problem:

https://issues.apache.org/jira/browse/NUTCH-2041

Which is a result of the crawl script not being aware of changes to ignore_external_links my nutch-site.xml file.

I am trying to crawl several sites and was hoping to keep my life simple by ignoring external links and leaving regex-urlfilter.txt alone (just using +.)

Now it looks like I'll have to change ignore_external_links back to false and add a regex filter for each of my urls. Hopefully I can get a nutch 1.11 release soon. It looks like this is fixed there.

Orphism answered 9/11, 2015 at 21:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.