Solr indexing following a Nutch crawl fails, reports "Job Failed"
Asked Answered
W

3

8

I have a site hosted on my local machine that I am attempting to crawl with Nutch and index in Solr (both also on my local machine). I installed Solr 4.6.1 and Nutch 1.7 per the instructions given on the Nutch site (http://wiki.apache.org/nutch/NutchTutorial), and I have Solr running in my browser without issue.

I am running the following command:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 1 -topN 2

The crawl is working fine, but when it attemps to put the data into Solr, it fails with the following output:

Indexer: starting at 2014-02-06 16:29:28
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance (mandatory)
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : use authentication (default false)
    solr.auth : username for authentication
    solr.auth.password : password for authentication


Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:81)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:65)
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:155)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

I went to the Nutch logs directory and tailed the hadoop.log file, it shows this:

2014-02-06 16:29:28,920 INFO  solr.SolrIndexWriter - Indexing 1 documents
2014-02-06 16:29:28,921 INFO  httpclient.HttpMethodDirector - I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server localhost failed to respond
2014-02-06 16:29:28,921 INFO  httpclient.HttpMethodDirector - Retrying request
2014-02-06 16:29:28,924 WARN  mapred.LocalJobRunner - job_local331896790_0009
java.io.IOException
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.makeIOException(SolrIndexWriter.java:173)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:159)
    at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:118)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:467)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:535)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
Caused by: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Connection reset
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:478)
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
    at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:155)
    ... 6 more
Caused by: java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:168)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
    at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
    at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
    at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)
    at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)
    at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
    at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
    at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
    at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422)

Yet, I am still able to access Solr in my browser just fine. This is my first try at Solr/Nutch - any help from those with more knowledge would be much appreciated. Thanks.

Wrinkly answered 7/2, 2014 at 0:40 Comment(0)
I
2

This happens when not all required fields from nutch are in the schema.xml of solr. Did you add the fields from Nutch's schema.xml?

If you add in the section "fields" the following, things should work:

<field name="id" type="string" stored="true" indexed="true"/>
<!-- core fields -->
<field name="segment" type="string" stored="true" indexed="false"/>
<field name="digest" type="string" stored="true" indexed="false"/>
<field name="boost" type="float" stored="true" indexed="false"/>

<!-- fields for index-basic plugin -->
<field name="host" type="string" stored="false" indexed="true"/>
<field name="url" type="url" stored="true" indexed="true"
    required="true"/>
<field name="content" type="text_general" stored="false" indexed="true"/>
<field name="title" type="text_general" stored="true" indexed="true"/>
<field name="cache" type="string" stored="true" indexed="false"/>
<field name="tstamp" type="date" stored="true" indexed="false"/>

<!-- fields for index-anchor plugin -->
<field name="anchor" type="string" stored="true" indexed="true"
    multiValued="true"/>

<!-- fields for index-more plugin -->
<field name="type" type="string" stored="true" indexed="true"
    multiValued="true"/>
<field name="contentLength" type="long" stored="true"
    indexed="false"/>
<field name="lastModified" type="date" stored="true"
    indexed="false"/>
<field name="date" type="date" stored="true" indexed="true"/>

<!-- fields for languageidentifier plugin -->
<field name="lang" type="string" stored="true" indexed="true"/>

<!-- fields for subcollection plugin -->
<field name="subcollection" type="string" stored="true"
    indexed="true" multiValued="true"/>

<!-- fields for feed plugin (tag is also used by microformats-reltag)-->
<field name="author" type="string" stored="true" indexed="true"/>
<field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>
<field name="feed" type="string" stored="true" indexed="true"/>
<field name="publishedDate" type="date" stored="true"
    indexed="true"/>
<field name="updatedDate" type="date" stored="true"
    indexed="true"/>

<!-- fields for creativecommons plugin -->
<field name="cc" type="string" stored="true" indexed="true"
    multiValued="true"/>

<!-- fields for tld plugin -->    
<field name="tld" type="string" stored="false" indexed="false"/>
Invert answered 14/2, 2014 at 11:40 Comment(0)
P
0

I had a similar problem with Nutch 1.8 and Solr 4.8.0. In fact Diaa's answer helped me solve the problem. After having removed some intersections of schema.xml with Diaa's field list and after having changed two entries marked as "added by wb" and "changed by wb" I ended up with the following field list which worked for me. As opposed to earlier versions of nutch and solr there is no tag for "fields" any more. Entries tagged as "field" are simply within "schema". This is the complete field list:

   <field name="_root_" type="string" indexed="true" stored="false"/>

   <!-- Only remove the "id" field if you have a very good reason to. While not strictly
     required, it is highly recommended. A <uniqueKey> is present in almost all Solr 
     installations. See the <uniqueKey> declaration below where <uniqueKey> is set to "id".
   -->   
   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 

   <field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
   <field name="name" type="text_general" indexed="true" stored="true"/>
   <field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
   <field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="features" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="includes" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />

   <field name="weight" type="float" indexed="true" stored="true"/>
   <field name="price"  type="float" indexed="true" stored="true"/>
   <field name="popularity" type="int" indexed="true" stored="true" />
   <field name="inStock" type="boolean" indexed="true" stored="true" />

   <field name="store" type="location" indexed="true" stored="true"/>

   <!-- Common metadata fields, named specifically to match up with
     SolrCell metadata when parsing rich documents such as Word, PDF.
     Some fields are multiValued only because Tika currently may return
     multiple values for them. Some metadata is parsed from the documents,
     but there are some which come from the client context:
       "content_type": From the HTTP headers of incoming stream
       "resourcename": From SolrCell request param resource.name
   -->
   <field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="subject" type="text_general" indexed="true" stored="true"/>
   <field name="description" type="text_general" indexed="true" stored="true"/>
   <field name="comments" type="text_general" indexed="true" stored="true"/>
   <field name="author" type="text_general" indexed="true" stored="true"/>
   <field name="keywords" type="text_general" indexed="true" stored="true"/>
   <field name="category" type="text_general" indexed="true" stored="true"/>
   <field name="resourcename" type="text_general" indexed="true" stored="true"/>

   <!-- added by wb: required="true" -->
   <field name="url" type="text_general" indexed="true" stored="true" required="true"/> 

   <field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="last_modified" type="date" indexed="true" stored="true"/>
   <field name="links" type="string" indexed="true" stored="true" multiValued="true"/>

   <!-- Main body of document extracted by SolrCell.
        NOTE: This field is not indexed by default, since it is also copied to "text"
        using copyField below. This is to save space. Use this field for returning and
        highlighting document content. Use the "text" field to search the content. -->

   <!-- changedby wb: indexed="true" -->
   <field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/> 


   <!-- catchall field, containing all other searchable text fields (implemented
        via copyField further on in this schema  -->
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

   <!-- catchall text field that indexes tokens both normally and in reverse for efficient
        leading wildcard queries. -->
   <field name="text_rev" type="text_general_rev" indexed="true" stored="false" multiValued="true"/>

   <!-- non-tokenized version of manufacturer to make it easier to sort or group
        results by manufacturer.  copied from "manu" via copyField -->
   <field name="manu_exact" type="string" indexed="true" stored="false"/>

   <field name="payloads" type="payloads" indexed="true" stored="true"/>

   <!-- Fields needed for Nutch 1.8 integration: -->

    <field name="segment" type="string" stored="true" indexed="false"/>
    <field name="digest" type="string" stored="true" indexed="false"/>
    <field name="boost" type="float" stored="true" indexed="false"/>

    <!-- fields for index-basic plugin -->
    <field name="host" type="string" stored="false" indexed="true"/>
    <field name="cache" type="string" stored="true" indexed="false"/>
    <field name="tstamp" type="date" stored="true" indexed="false"/>

    <!-- fields for index-anchor plugin -->
    <field name="anchor" type="string" stored="true" indexed="true" multiValued="true"/>

    <!-- fields for index-more plugin -->
    <field name="type" type="string" stored="true" indexed="true" multiValued="true"/>
    <field name="contentLength" type="long" stored="true" indexed="false"/>
    <field name="lastModified" type="date" stored="true" indexed="false"/>
    <field name="date" type="date" stored="true" indexed="true"/>

    <!-- fields for languageidentifier plugin -->
    <field name="lang" type="string" stored="true" indexed="true"/>

    <!-- fields for subcollection plugin -->
    <field name="subcollection" type="string" stored="true" indexed="true" multiValued="true"/>

    <!-- fields for feed plugin (tag is also used by microformats-reltag)-->
    <field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>
    <field name="feed" type="string" stored="true" indexed="true"/>
    <field name="publishedDate" type="date" stored="true" indexed="true"/>
    <field name="updatedDate" type="date" stored="true" indexed="true"/>

    <!-- fields for creativecommons plugin -->
    <field name="cc" type="string" stored="true" indexed="true" multiValued="true"/>

    <!-- fields for tld plugin -->    
    <field name="tld" type="string" stored="false" indexed="false"/>

   <!-- End of fields needed for Nutch 1.8 integration: -->
Picasso answered 1/5, 2014 at 13:44 Comment(1)
I tried it but its throwing same error, could you help around. I am using solr 4.8 and nutch 1.12Mammillate
S
0

hello i know this question is old , but for the people using nutch and solr in 2017 with version (nutch 1.13 , solr 5.5.0) , i had the same problem i just solve with following solution

bin/crawl -i -D solr.server.url=http://localhost:8983/solr/#/nutch urls/ TestCrawl2/ 1

above is the command i a using for crawl , but i had same error when i using this

bin/crawl -i -D solr.server.url=http://localhost:8983/solr/nutch urls TestCrawl2 2

i just remove the '/' after urls/ TestCrawl2/ , it works for me thanks

Schuck answered 30/11, 2017 at 11:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.