nutch Questions

3

Solved

I need to access a lucene index ( created by crawling several webpages using Nutch) but it is giving the error shown above : java.io.FileNotFoundException: no segments* file found in org.apache.l...
Disequilibrium asked 27/9, 2010 at 8:6

3

I am trying to crawl a website, more specifically a Google Site using ManifoldCF that has SAML authentication and index the crawled data into Apache Solr. But as I crawl the URL, it gives me 302 re...
Congregation asked 8/8, 2016 at 14:7

2

Solved

I have followed article: https://wiki.apache.org/nutch/NutchTutorial and set up apache nutch +solr. But i want to clarify if i understood correct about working of nutch steps. 1). Inject: In this ...
Leesaleese asked 12/4, 2015 at 12:21

2

We need to crawl a large number (~1.5 billion) of web pages every two weeks. Speed, hence cost, is a huge factor for us as our initial attempts have ended up costing us over $20k. Is there a...
Norse asked 10/10, 2017 at 18:41

5

Solved

I'm new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the binary database files. How do...
Decentralize asked 4/4, 2012 at 8:6

3

I have a site hosted on my local machine that I am attempting to crawl with Nutch and index in Solr (both also on my local machine). I installed Solr 4.6.1 and Nutch 1.7 per the instructions given ...
Wrinkly asked 7/2, 2014 at 0:40

4

I want to open Nutch 2.1 source file (http://www.eu.apache.org/dist/nutch/2.1/) at Intellij IDEA. Here is an explanation of how to open it at Eclipse: http://wiki.apache.org/nutch/RunNutchInEclipse...
Kimmie asked 12/3, 2013 at 9:27

1

I'm using Linux with Hadoop, Cloudera and HBase. Could you tell me how to correct this error? Error: could to find or load main class org.apache.nutch.crawl.InjectorJob The following command ga...
Threw asked 9/3, 2015 at 9:27

0

I am trying to get nutch 1.11 to execute a crawl. I am using cygwin to run these commands in Windows 8. I have put hadoop-core jar into lib folder but when I try to run a crawl I obtain: Except...
Combs asked 12/5, 2016 at 8:48

1

I am trying to set up Apache Nutch to crawl URLs, following this guide. Being an older guide (The guide is for 1.x, I am using 2.3), I have made the necessary changes to structure. However, when I ...
Reachmedown asked 15/11, 2015 at 8:50

4

Solved

Has anyone had any luck writing custom indexers for nutch to index the crawl results with elasticsearch? Or do you know of any that already exist?
Marlomarlon asked 15/5, 2011 at 23:58

1

What is the maximum number of Apache Nutch crawler instances that can run at the same time with one master node?
Underhand asked 17/12, 2015 at 2:39

1

Solved

When I run nutch 1.10 with the following command, assuming that TestCrawl2 did not previously exist and needs to be created,... sudo -E bin/crawl -i -D solr.server.url=http://localhost:8983/solr/T...
Orphism asked 3/11, 2015 at 20:44

1

Solved

I am using nutch 2.3. All jobs run one after other i.e. first generator, fetch, parse, index etc. I want to run some jobs simultaneously. I know some jobs cannot run in parallel but other can e.g p...
Walkin asked 5/5, 2015 at 6:35

1

Solved

I am new to Nutch. I need to crawl the web (say, a few hundred web pages), read the crawled data and do some analysis. I followed the link https://wiki.apache.org/nutch/NutchTutorial (and integrat...
Saltcellar asked 30/3, 2015 at 9:43

4

Is there is any way to get the html content of each webpage in nutch while crawling the web page?
Perilymph asked 25/2, 2011 at 23:16

0

I am crawling few data from web using apache nutch 2.3. My solr version is 4.10.3. Data is crawled successfully in hbase and indexed also in solr but at end (dedup stage ) Follwoing error appears i...
Strained asked 28/1, 2015 at 5:53

2

Solved

I am using zookeeper ensemble for hbase. Zookeeper is running on 3 machines. While HBase is also in fully distributed mode. I have Nutch 2.x version. When I start nutch to crawl some data, it gives...
Inexperience asked 23/1, 2015 at 12:13

2

I am trying to use nutch 1.6 from the windows environment but every time I try to run as per the procedure given in the site Nutch Tuorial Apache I always end up with the following exception: Excep...
Clock asked 24/12, 2012 at 7:3

2

Solved

I have been running nutch crawling commands for the passed 3 weeks and now I get the below error when I try to run any nutch command: Java HotSpot(TM) 64-Bit Server VM warning: Insufficient space ...
Victoir asked 12/1, 2013 at 5:19

5

I'm trying to build a specialised search engine web site that indexes a limited number of web sites. The solution I came up with is: using Nutch as the web crawler, using Solr as the search...
Saideman asked 24/11, 2010 at 17:24

7

Solved

I'm working on a crawler and need to understand exactly what is meant by "link depth". Take nutch for example: http://wiki.apache.org/nutch/NutchTutorial depth indicates the link depth from the ...
Interpreter asked 4/12, 2010 at 23:54

1

I have starting working with nutch and solr and I have a problem with integrating Solr with Nutch. I followed this tutorial: http://wiki.apache.org/nutch/NutchTutorial and after using: bin/nutch cr...
Mckenziemckeon asked 17/11, 2012 at 9:56

1

Hi I am trying to run Apache Nutch 1.2 on Amazon's EMR. To do this I specifiy an input directory from S3. I get the following error: Fetcher: java.lang.IllegalArgumentException: This file syste...
Elect asked 30/8, 2011 at 1:52

1

Solved

Anyone knows an efficient way to extract the text context that wraps an outlink URL. For example, given this sample text containing an outlink: Nutch can run on a single machine, but gains a lot...
Dodge asked 9/3, 2014 at 14:47

© 2022 - 2024 — McMap. All rights reserved.