nutch - McMap

3

Solved

I need to access a lucene index ( created by crawling several webpages using Nutch) but it is giving the error shown above : java.io.FileNotFoundException: no segments* file found in org.apache.l...

java lucene nutch

Disequilibrium asked 27/9, 2010 at 8:6

3

How to crawl a website that has SAML authentication using ManifoldCF or nutch?

I am trying to crawl a website, more specifically a Google Site using ManifoldCF that has SAML authentication and index the crawled data into Apache Solr. But as I crawl the URL, it gives me 302 re...

solr saml nutch full-text-indexing manifoldcf

Congregation asked 8/8, 2016 at 14:7

2

Solved

Apache Nutch steps explaination

I have followed article: https://wiki.apache.org/nutch/NutchTutorial and set up apache nutch +solr. But i want to clarify if i understood correct about working of nutch steps. 1). Inject: In this ...

apache nutch

Leesaleese asked 12/4, 2015 at 12:21

2

Nutch vs Heritrix vs Stormcrawler vs MegaIndex vs Mixnode [closed]

We need to crawl a large number (~1.5 billion) of web pages every two weeks. Speed, hence cost, is a huge factor for us as our initial attempts have ended up costing us over $20k. Is there a...

web-crawler nutch heritrix stormcrawler

Norse asked 10/10, 2017 at 18:41

5

Solved

How do I save the origin html file with Apache Nutch

I'm new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the binary database files. How do...

search-engine web-crawler nutch

Decentralize asked 4/4, 2012 at 8:6

3

Solr indexing following a Nutch crawl fails, reports "Job Failed"

I have a site hosted on my local machine that I am attempting to crawl with Nutch and index in Solr (both also on my local machine). I installed Solr 4.6.1 and Nutch 1.7 per the instructions given ...

solr nutch

Wrinkly asked 7/2, 2014 at 0:40

4

How to Open an Ant project (Nutch Source) at Intellij Idea?

I want to open Nutch 2.1 source file (http://www.eu.apache.org/dist/nutch/2.1/) at Intellij IDEA. Here is an explanation of how to open it at Eclipse: http://wiki.apache.org/nutch/RunNutchInEclipse...

ant intellij-idea nutch

Kimmie asked 12/3, 2013 at 9:27

1

could to find or load main class org.apache.nutch.crawl.InjectorJob

I'm using Linux with Hadoop, Cloudera and HBase. Could you tell me how to correct this error? Error: could to find or load main class org.apache.nutch.crawl.InjectorJob The following command ga...

hadoop solr nutch

Threw asked 9/3, 2015 at 9:27

0

Nutch problems executing crawl on Windows

I am trying to get nutch 1.11 to execute a crawl. I am using cygwin to run these commands in Windows 8. I have put hadoop-core jar into lib folder but when I try to run a crawl I obtain: Except...

windows web-crawler nutch

Combs asked 12/5, 2016 at 8:48

1

Apache Nutch - Problems with Paths

I am trying to set up Apache Nutch to crawl URLs, following this guide. Being an older guide (The guide is for 1.x, I am using 2.3), I have made the necessary changes to structure. However, when I ...

java apache nutch

Reachmedown asked 15/11, 2015 at 8:50

4

Solved

Have you indexed nutch crawl results using elasticsearch before?

Has anyone had any luck writing custom indexers for nutch to index the crawl results with elasticsearch? Or do you know of any that already exist?

lucene full-text-search web-crawler nutch elasticsearch

Marlomarlon asked 15/5, 2011 at 23:58

1

Maximum number of Apache Nutch worker instances

What is the maximum number of Apache Nutch crawler instances that can run at the same time with one master node?

hadoop nutch

Underhand asked 17/12, 2015 at 2:39

1

Solved

nutch 1.10 input path does not exist /linkdb/current

When I run nutch 1.10 with the following command, assuming that TestCrawl2 did not previously exist and needs to be created,... sudo -E bin/crawl -i -D solr.server.url=http://localhost:8983/solr/T...

hadoop solr nutch

Orphism asked 3/11, 2015 at 20:44

1

Solved

How to run apache nutch different jobs in parallel manner

I am using nutch 2.3. All jobs run one after other i.e. first generator, fetch, parse, index etc. I want to run some jobs simultaneously. I know some jobs cannot run in parallel but other can e.g p...

java apache web-crawler nutch

Walkin asked 5/5, 2015 at 6:35

1

Solved

Where is the crawled data stored when running nutch crawler?

I am new to Nutch. I need to crawl the web (say, a few hundred web pages), read the crawled data and do some analysis. I followed the link https://wiki.apache.org/nutch/NutchTutorial (and integrat...

web-crawler nutch

Saltcellar asked 30/3, 2015 at 9:43

4

How to get the html content from nutch

Is there is any way to get the html content of each webpage in nutch while crawling the web page?

nutch

Perilymph asked 25/2, 2011 at 23:16

0

Solr dedup error Failed with exit value 255

I am crawling few data from web using apache nutch 2.3. My solr version is 4.10.3. Data is crawled successfully in hbase and indexed also in solr but at end (dedup stage ) Follwoing error appears i...

java apache solr web-crawler nutch

Strained asked 28/1, 2015 at 5:53

2

Solved

zookeeper unable to open socket to localhost/0:0:0:0:0:0:0:1:2181

I am using zookeeper ensemble for hbase. Zookeeper is running on 3 machines. While HBase is also in fully distributed mode. I have Nutch 2.x version. When I start nutch to crawl some data, it gives...

apache hbase nutch apache-zookeeper

Inexperience asked 23/1, 2015 at 12:13

2

Using nutch in Windows 7

I am trying to use nutch 1.6 from the windows environment but every time I try to run as per the procedure given in the site Nutch Tuorial Apache I always end up with the following exception: Excep...

windows windows-7 cygwin nutch

Clock asked 24/12, 2012 at 7:3

2

Solved

Insufficient space for shared memory file when I try to run nutch generate command

I have been running nutch crawling commands for the passed 3 weeks and now I get the below error when I try to run any nutch command: Java HotSpot(TM) 64-Bit Server VM warning: Insufficient space ...

java jvm nutch

Victoir asked 12/1, 2013 at 5:19

5

An alternative web crawler to Nutch [closed]

I'm trying to build a specialised search engine web site that indexes a limited number of web sites. The solution I came up with is: using Nutch as the web crawler, using Solr as the search...

search-engine web-crawler nutch

Saideman asked 24/11, 2010 at 17:24

7

Solved

Web Cralwer Algorithm: depth?

I'm working on a crawler and need to understand exactly what is meant by "link depth". Take nutch for example: http://wiki.apache.org/nutch/NutchTutorial depth indicates the link depth from the ...

algorithm web-crawler nutch

Interpreter asked 4/12, 2010 at 23:54

1

Error while indexing in solr data crawled by nutch

I have starting working with nutch and solr and I have a problem with integrating Solr with Nutch. I followed this tutorial: http://wiki.apache.org/nutch/NutchTutorial and after using: bin/nutch cr...

solr indexing runtime-error nutch

Mckenziemckeon asked 17/11, 2012 at 9:56

1

Nutch on EMR problem reading from S3

Hi I am trying to run Apache Nutch 1.2 on Amazon's EMR. To do this I specifiy an input directory from S3. I get the following error: Fetcher: java.lang.IllegalArgumentException: This file syste...

java hadoop amazon-web-services nutch

Elect asked 30/8, 2011 at 1:52

1

Solved

Apache Nutch: Get outlink URL's text context

Anyone knows an efficient way to extract the text context that wraps an outlink URL. For example, given this sample text containing an outlink: Nutch can run on a single machine, but gains a lot...

apache hadoop web-scraping nutch

Dodge asked 9/3, 2014 at 14:47

nutch Questions

Recommended topics

Hot tags