nutch Questions

5

Solved

I'm trying to user Solr with Nutch on a Windows Machine and I'm getting the following error: Exception in thread "main" java.io.IOException: Failed to set permissions of path: c:\temp\map...
Inheritor asked 3/3, 2013 at 16:53

5

Solved

I'm doing some testing with nutch and hadoop and I need a massive amount of data. I want to start with 20GB, go to 100 GB, 500 GB and eventually reach 1-2 TB. The problem is that I don't have this...
Bathsheeb asked 29/12, 2011 at 12:59

2

Solved

I am using nutch 1.3 to crawl a website. I want to get a list of urls crawled, and urls originating from a page. I get list of urls crawled using readdb command. bin/nutch readdb crawl/crawldb -d...
Acceleration asked 15/9, 2011 at 2:13

1

I am using Nutch to crawl websites and I want to parse specific sections of html pages crawled by Nutch. For example, <h><title> title to search </title></h> <div id="...
Probationer asked 31/7, 2013 at 14:2

1

Solved

I need to create a Nutch plugin that communicate with some external applications using Akka. In order to do this, I need to package the plugin as a fat Jar - I am using sbt-assembly version 0.8.3. ...
Kurt asked 4/3, 2013 at 13:51

3

I am using Nutch 2.1 integrated with mysql. I had crawled 2 sites and Nutch successfully crawled them and stored the data into the Mysql. I am using Solr 4.0.0 for searching. Now my problem is, wh...
Enjoin asked 14/12, 2012 at 6:21

2

I am trying to run Nutch 2 crawler on my system but I get the following error: Exception in thread "main" org.apache.gora.util.GoraException: java.io.IOException: java.sql.SQLTransientConnectionEx...
Exportation asked 25/9, 2012 at 10:53

1

I crawl few sites with Apache Nutch 2.1. While crawling I see the following message on lot of pages: ex. Skipping http://www.domainname.com/news/subcategory/111111/index.html; different batch id (...
Lysias asked 12/2, 2013 at 8:33

1

Solved

I've tried to follow the nutch tutorial but having a bit of a problem with the schema.xml file. I was told to the nutch provided schema to my project, essentially this... cp ${NUTCH_RUNTIME_HOME}...
Madura asked 11/4, 2013 at 10:2

3

Solved

I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?
Israelisraeli asked 10/1, 2013 at 15:40

1

I'm look for a framework to grab articles, then I find Nutch 2.1. Here's my plan and questions in each: 1 Add article list pages into url/seed.txt Here's one problem. What I actually want to be ...
Dinsmore asked 15/12, 2012 at 15:13

2

Solved

I have recently started working on nutch and I am trying to understand how it works. As far as I know Nutch is basically used to crawl the web and solr/Lucene is used to index and search. But when ...
Adkisson asked 1/6, 2012 at 5:18

1

Solved

I eager to know (and have to know) about the nutch and its algorithms (because it relates to my project) that it uses to fetch,classify,...(generally Crawling). I read this material but its a littl...
Pelt asked 27/7, 2012 at 22:22

2

Solved

Is there parameter in the bin/nutch solrindex command to indicate which Solr core to index to?
Mooney asked 1/5, 2012 at 7:37

1

Solved

i am trying to run Nutch with Cygwin. I am having problems setting the JAVA_HOME. $ export JAVA_HOME='/cygdrive/f/program files/java/jdk1.6.0_21' When i run nutch command $ bin/nutch crawl i...
Albaalbacete asked 19/2, 2012 at 0:47

2

Solved

Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property. at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1166) a...
Softspoken asked 5/7, 2011 at 12:51

1

Solved

Am I being thick or is there really no way to invoke Apache Nutch through some Java code programmatically? Where is the documentation (or a guide or tutorial) on how to do this? Google has failed m...
Jordaens asked 24/3, 2011 at 14:50

2

Im new in web crawling. I'm going to build a search engine which the crawler saves Rapidshare links including URL where that Rapidshare links found... In other words, I'm going to build a website...
Doubles asked 7/1, 2011 at 15:5

2

I am developing a system that has to track content of few portals and check changes every night (for example download and index new sites that have been added during the day). Content of this porta...
Empirical asked 6/1, 2011 at 18:46

1

Solved

I'm working on a project where I need a mature crawler to do some work, and I'm evaluating Nutch for this purpose. My current needs are relatively straightforward: I need a crawler that is able to...
Plenitude asked 2/12, 2010 at 21:37

3

Solved

For the past month I've been using Scrapy for a web crawling project I've begun. This project involves pulling down the full document content of all web pages in a single domain name that are reac...
Autobiography asked 6/8, 2010 at 13:8

3

Let's say I want to aggregate information related to a specific niche from many sources (could be travel, technology, or whatever). How would I do that? Have a spider/crawler who will crawl ...
Redshank asked 29/5, 2009 at 22:36

1

Solved

Currently collecting information where I should use Nutch with Solr (domain - vertical web search). Could you suggest me?
Infidel asked 12/5, 2010 at 11:0

4

Solved

I search for a web crawler solution which can is mature enough and can be simply extended. I am interested in the following features... or possibility to extend the crawler to meet them: partly ju...
Discovert asked 18/1, 2010 at 10:11

10

Solved

Our company has thousands of PDF documents. How do we create a simple search engine using Lucene, Solr or Nutch? We'll provide a basic Java/JSP web page were people can type in words and perform ba...
Directive asked 21/10, 2008 at 21:15

© 2022 - 2024 — McMap. All rights reserved.