nutch Questions
3
Solved
I need to access a lucene index ( created by crawling several webpages using Nutch) but it is giving the error shown above :
java.io.FileNotFoundException: no segments* file found in org.apache.l...
3
I am trying to crawl a website, more specifically a Google Site using ManifoldCF that has SAML authentication and index the crawled data into Apache Solr. But as I crawl the URL, it gives me 302 re...
Congregation asked 8/8, 2016 at 14:7
2
Solved
I have followed article: https://wiki.apache.org/nutch/NutchTutorial and set up apache nutch +solr. But i want to clarify if i understood correct about working of nutch steps.
1). Inject: In this ...
2
We need to crawl a large number (~1.5 billion) of web pages every two weeks. Speed, hence cost, is a huge factor for us as our initial attempts have ended up costing us over $20k.
Is there a...
Norse asked 10/10, 2017 at 18:41
5
Solved
I'm new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the binary database files. How do...
Decentralize asked 4/4, 2012 at 8:6
3
I have a site hosted on my local machine that I am attempting to crawl with Nutch and index in Solr (both also on my local machine). I installed Solr 4.6.1 and Nutch 1.7 per the instructions given ...
4
I want to open Nutch 2.1 source file (http://www.eu.apache.org/dist/nutch/2.1/) at Intellij IDEA. Here is an explanation of how to open it at Eclipse: http://wiki.apache.org/nutch/RunNutchInEclipse...
Kimmie asked 12/3, 2013 at 9:27
1
I'm using Linux with Hadoop, Cloudera and HBase.
Could you tell me how to correct this error?
Error: could to find or load main class org.apache.nutch.crawl.InjectorJob
The following command ga...
0
I am trying to get nutch 1.11 to execute a crawl. I am using cygwin to run these commands in Windows 8.
I have put hadoop-core jar into lib folder but when I try to run a crawl I obtain:
Except...
Combs asked 12/5, 2016 at 8:48
1
I am trying to set up Apache Nutch to crawl URLs, following this guide. Being an older guide (The guide is for 1.x, I am using 2.3), I have made the necessary changes to structure. However, when I ...
4
Solved
Has anyone had any luck writing custom indexers for nutch to index the crawl results with elasticsearch? Or do you know of any that already exist?
Marlomarlon asked 15/5, 2011 at 23:58
1
What is the maximum number of Apache Nutch crawler instances that can run at the same time with one master node?
1
Solved
When I run nutch 1.10 with the following command, assuming that TestCrawl2 did not previously exist and needs to be created,...
sudo -E bin/crawl -i -D solr.server.url=http://localhost:8983/solr/T...
1
Solved
I am using nutch 2.3. All jobs run one after other i.e. first generator, fetch, parse, index etc. I want to run some jobs simultaneously. I know some jobs cannot run in parallel but other can e.g p...
Walkin asked 5/5, 2015 at 6:35
1
Solved
I am new to Nutch. I need to crawl the web (say, a few hundred web pages), read the crawled data and do some analysis.
I followed the link https://wiki.apache.org/nutch/NutchTutorial (and integrat...
Saltcellar asked 30/3, 2015 at 9:43
4
Is there is any way to get the html content of each webpage in nutch while crawling the web page?
Perilymph asked 25/2, 2011 at 23:16
0
I am crawling few data from web using apache nutch 2.3. My solr version is 4.10.3. Data is crawled successfully in hbase and indexed also in solr but at end (dedup stage ) Follwoing error appears i...
Strained asked 28/1, 2015 at 5:53
2
Solved
I am using zookeeper ensemble for hbase. Zookeeper is running on 3 machines. While HBase is also in fully distributed mode. I have Nutch 2.x version. When I start nutch to crawl some data, it gives...
Inexperience asked 23/1, 2015 at 12:13
2
I am trying to use nutch 1.6 from the windows environment but every time I try to run as per the procedure given in the site Nutch Tuorial Apache I always end up with the following exception:
Excep...
2
Solved
I have been running nutch crawling commands for the passed 3 weeks and now I get the below error when I try to run any nutch command:
Java HotSpot(TM) 64-Bit Server VM warning: Insufficient space ...
5
I'm trying to build a specialised search engine web site that indexes a limited number of web sites. The solution I came up with is:
using Nutch as the web crawler,
using Solr as the search...
Saideman asked 24/11, 2010 at 17:24
7
Solved
I'm working on a crawler and need to understand exactly what is meant by "link depth". Take nutch for example: http://wiki.apache.org/nutch/NutchTutorial
depth indicates the link depth from the ...
Interpreter asked 4/12, 2010 at 23:54
1
I have starting working with nutch and solr and I have a problem with integrating Solr with Nutch. I followed this tutorial: http://wiki.apache.org/nutch/NutchTutorial and after using:
bin/nutch cr...
Mckenziemckeon asked 17/11, 2012 at 9:56
1
Hi I am trying to run Apache Nutch 1.2 on Amazon's EMR.
To do this I specifiy an input directory from S3. I get the following error:
Fetcher: java.lang.IllegalArgumentException:
This file syste...
Elect asked 30/8, 2011 at 1:52
1
Solved
Anyone knows an efficient way to extract the text context that wraps an outlink URL. For example, given this sample text containing an outlink:
Nutch can run on a single machine, but gains a lot...
Dodge asked 9/3, 2014 at 14:47
1 Next >
© 2022 - 2024 — McMap. All rights reserved.