nutch - 2 - McMap

5

Solved

Nutch in Windows: Failed to set permissions of path

I'm trying to user Solr with Nutch on a Windows Machine and I'm getting the following error: Exception in thread "main" java.io.IOException: Failed to set permissions of path: c:\temp\map...

windows solr hadoop cygwin nutch

Inheritor asked 3/3, 2013 at 16:53

5

Solved

How to produce massive amount of data?

I'm doing some testing with nutch and hadoop and I need a massive amount of data. I want to start with 20GB, go to 100 GB, 500 GB and eventually reach 1-2 TB. The problem is that I don't have this...

java hadoop nutch bigdata

Bathsheeb asked 29/12, 2011 at 12:59

2

Solved

get out links from nutch

I am using nutch 1.3 to crawl a website. I want to get a list of urls crawled, and urls originating from a page. I get list of urls crawled using readdb command. bin/nutch readdb crawl/crawldb -d...

web-crawler nutch

Acceleration asked 15/9, 2011 at 2:13

1

How to parse content located in specific HTML tags using nutch plugin?

I am using Nutch to crawl websites and I want to parse specific sections of html pages crawled by Nutch. For example, <h><title> title to search </title></h> <div id="...

nutch

Probationer asked 31/7, 2013 at 14:2

1

Solved

Creating an Akka fat Jar

I need to create a Nutch plugin that communicate with some external applications using Akka. In order to do this, I need to package the plugin as a fat Jar - I am using sbt-assembly version 0.8.3. ...

scala sbt akka nutch sbt-assembly

Kurt asked 4/3, 2013 at 13:51

3

How to recrawle nutch

I am using Nutch 2.1 integrated with mysql. I had crawled 2 sites and Nutch successfully crawled them and stored the data into the Mysql. I am using Solr 4.0.0 for searching. Now my problem is, wh...

nutch web-crawler

Enjoin asked 14/12, 2012 at 6:21

2

connection refused error when running Nutch 2

I am trying to run Nutch 2 crawler on my system but I get the following error: Exception in thread "main" org.apache.gora.util.GoraException: java.io.IOException: java.sql.SQLTransientConnectionEx...

java web-crawler nutch

Exportation asked 25/9, 2012 at 10:53

1

Apache Nutch 2.1 different batch id (null)

I crawl few sites with Apache Nutch 2.1. While crawling I see the following message on lot of pages: ex. Skipping http://www.domainname.com/news/subcategory/111111/index.html; different batch id (...

apache nutch web-crawler

Lysias asked 12/2, 2013 at 8:33

1

Solved

Apache Nutch and Solr integration

I've tried to follow the nutch tutorial but having a bit of a problem with the schema.xml file. I was told to the nutch provided schema to my project, essentially this... cp ${NUTCH_RUNTIME_HOME}...

linux solr lucene nutch

Madura asked 11/4, 2013 at 10:2

3

Solved

Recrawl URL with Nutch just for updated sites

I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?

apache solr lucene nutch web-crawler

Israelisraeli asked 10/1, 2013 at 15:40

1

How to extend Nutch for article crawling

I'm look for a framework to grab articles, then I find Nutch 2.1. Here's my plan and questions in each: 1 Add article list pages into url/seed.txt Here's one problem. What I actually want to be ...

web-crawler nutch

Dinsmore asked 15/12, 2012 at 15:13

2

Solved

nutch vs solr indexing

I have recently started working on nutch and I am trying to understand how it works. As far as I know Nutch is basically used to crawl the web and solr/Lucene is used to index and search. But when ...

solr lucene nutch

Adkisson asked 1/6, 2012 at 5:18

1

Solved

what is going on inside of Nutch 2?

I eager to know (and have to know) about the nutch and its algorithms (because it relates to my project) that it uses to fetch,classify,...(generally Crawling). I read this material but its a littl...

algorithm analysis nutch infrastructure

Pelt asked 27/7, 2012 at 22:22

2

Solved

Using Nutch solrindex to index to multiple cores?

Is there parameter in the bin/nutch solrindex command to indicate which Solr core to index to?

solr nutch

Mooney asked 1/5, 2012 at 7:37

1

Solved

Nutch-Cygwin How to set JAVA_HOME

i am trying to run Nutch with Cygwin. I am having problems setting the JAVA_HOME. $ export JAVA_HOME='/cygdrive/f/program files/java/jdk1.6.0_21' When i run nutch command $ bin/nutch crawl i...

cygwin nutch

Albaalbacete asked 19/2, 2012 at 0:47

2

Solved

Nutch No agents listed in 'http.agent.name'

Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property. at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1166) a...

web-crawler nutch

Softspoken asked 5/7, 2011 at 12:51

1

Solved

Nutch: Invoke in Java, not command line?

Am I being thick or is there really no way to invoke Apache Nutch through some Java code programmatically? Where is the documentation (or a guide or tutorial) on how to do this? Google has failed m...

java web-crawler nutch

Jordaens asked 24/3, 2011 at 14:50

2

Suggestion for building search engine using Django

Im new in web crawling. I'm going to build a search engine which the crawler saves Rapidshare links including URL where that Rapidshare links found... In other words, I'm going to build a website...

django search-engine nutch scrapy

Doubles asked 7/1, 2011 at 15:5

2

Re-crawling websites fast

I am developing a system that has to track content of few portals and check changes every night (for example download and index new sites that have been added during the day). Content of this porta...

wget web-crawler nutch

Empirical asked 6/1, 2011 at 18:46

1

Solved

Nutch API advice

I'm working on a project where I need a mature crawler to do some work, and I'm evaluating Nutch for this purpose. My current needs are relatively straightforward: I need a crawler that is able to...

java web-crawler nutch

Plenitude asked 2/12, 2010 at 21:37

3

Solved

Best web graph crawler for speed?

For the past month I've been using Scrapy for a web crawling project I've begun. This project involves pulling down the full document content of all web pages in a single domain name that are reac...

scrapy web-crawler nutch

Autobiography asked 6/8, 2010 at 13:8

3

How is an aggregator built? [closed]

Let's say I want to aggregate information related to a specific niche from many sources (could be travel, technology, or whatever). How would I do that? Have a spider/crawler who will crawl ...

web-services aggregation web-crawler nutch

Redshank asked 29/5, 2009 at 22:36

1

Solved

Nutch versus Solr

Currently collecting information where I should use Nutch with Solr (domain - vertical web search). Could you suggest me?

solr nutch

Infidel asked 12/5, 2010 at 11:0

4

Solved

Does any open, simply extendible web crawler exists?

I search for a web crawler solution which can is mature enough and can be simply extended. I am interested in the following features... or possibility to extend the crawler to meet them: partly ju...

web-scraping web-crawler nutch

Discovert asked 18/1, 2010 at 10:11

10

Solved

How do we create a simple search engine using Lucene, Solr or Nutch?

Our company has thousands of PDF documents. How do we create a simple search engine using Lucene, Solr or Nutch? We'll provide a basic Java/JSP web page were people can type in words and perform ba...

lucene solr nutch

Directive asked 21/10, 2008 at 21:15

nutch Questions

Recommended topics

Hot tags