textual content without metadata from Tika via SolrCell
Asked Answered
R

5

5

Using Solr 3.6 and the ExtractionRequestHandler (aka Tika), is it possible to map just the textual content (of a PDF) to a field minus the metadata? The "content" field produced by Tika unfortunately contains all the metadata munged in with the text content of the document.

I would like to provide some snippet highlighting of the content and the subject metadata within the content field is skewing the highlight results.

UPDATE: Screenshot of Tika output as indexed by Solr. Highlighted portion is the block of metadata that gets prepended as a block of text to the PDF content.

solr screenshot of tika output

The ExtractingRequestHandler in solrconfig.xml:

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler">
    <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>
    </lst>
</requestHandler>

Schema.xml fields. Note "content" receives Tika's content output directly. The "page" and "collection" fields are set with literal values when a doc is posted to the handler.

<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="collection" type="text_general" indexed="true" stored="true"/>
<field name="page" type="tint" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
Rask answered 4/6, 2012 at 21:43 Comment(2)
Tika gives you the metadata and the content independently, sadly I don't know how to configure SOLR to ignore one of them...Anthologize
@Anthologize better late than never..so i had the same situation, and found out, the captureAttr was the thing causing issues. See my anwsHomophonic
H
6

As all other answers are completely irrelevant, I'll post mine:

I have experienced exactly the same problem as OP describes, (Solr 4.3.0, custom config, custom schema, etc. I'm not newbie or something and understand Solr internals pretty well)

This was my ERH config:

  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="uprefix">ignored_</str>
      <str name="fmap.a">ignored_</str>
      <str name="fmap.div">ignored_</str>
      <str name="fmap.content">text</str>
      <str name="captureAttr">false</str>

      <str name="lowernames">true</str>
      <bool name="ignoreTikaException">true</bool>
    </lst>
  </requestHandler>

It was basically configured to ignore everything except the content (i believe it's reasonable for many people).

After careful investigation i found out, that

<str name="captureAttr">false</str>

was the thing caused OP's issue. By default it is turned on, but i turned it off as i did not need it anyway. And that was my mistake. I have no idea why, but it causes Solr to put extracted attributes into fmap.content field altogether with extracted text.

So the solution is to turn it back on. Final ERH:

  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="uprefix">ignored_</str>
      <str name="fmap.a">ignored_</str>
      <str name="fmap.div">ignored_</str>
      <str name="fmap.content">text</str>
      <str name="captureAttr">true</str>

      <str name="lowernames">true</str>
      <bool name="ignoreTikaException">true</bool>
    </lst>
  </requestHandler>

Now, only extracted text is put to fmap.content field.

Unfortunately i have not found any piece of documentation which can explain this. Either bug or just stupid behavior

Homophonic answered 20/2, 2014 at 14:2 Comment(1)
Hi, I am facing the same issue. I want only the content of h1,h2 and p tags. But the crawler returns the whole element including all attribute values. Any solution would be greatly appreciated.Corps
E
2

Tika with Solr produces different fields for the content and the metadata.

If you use the Standard ExtractingRequestHandler -

  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <!-- All the main content goes into "text"... if you need to return
           the extracted text or do highlighting, use a stored field. -->
      <str name="fmap.content">text</str>
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>

      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>   
</requestHandler>

The field map content is set to text field which should be only the content of your pdf.

The other metadata fields can be easily checked by modifying the schema.xml.

mark stored true for igonred field type

<fieldtype name="ignored" stored="true" indexed="false" multiValued="true" class="solr.StrField" />

Capture all fields -

   <dynamicField name="*" type="ignored" multiValued="true" />

Tika adds lot of fields for the metadata with the content being set separately e.g. response when fed extract handler with a ppt.

<doc>
    <arr name="application_name">
        <str>Microsoft PowerPoint</str>
    </arr>
    <str name="category">POT - US</str>
    <str name="comments">version 1.1</str>
    <arr name="company">
        <str>
        </str>
    </arr>
    <arr name="content_type">
        <str>application/vnd.ms-powerpoint</str>
    </arr>
    <arr name="creation_date">
        <str>2000-03-15T16:57:27Z</str>
    </arr>
    <arr name="custom_delivery_date">
        <str>
        </str>
    </arr>
    <arr name="custom_docid">
        <str>
        </str>
    </arr>
    <arr name="custom_docidinslide">
        <str>true</str>
    </arr>
    <arr name="custom_docidintitle">
        <str>true</str>
    </arr>
    <arr name="custom_docidposition">
        <str>0</str>
    </arr>
    <arr name="custom_event">
        <str>
        </str>
    </arr>
    <arr name="custom_final">
        <str>false</str>
    </arr>
    <arr name="custom_mckpapersize">
        <str>US</str>
    </arr>
    <arr name="custom_notespagelayout">
        <str>Lower</str>
    </arr>
    <arr name="custom_title">
        <str>Lower Universal Template US</str>
    </arr>
    <arr name="custom_universal_objects">
        <str>true</str>
    </arr>
    <arr name="edit_time">
        <str>284587970000</str>
    </arr>
    <str name="id">101</str>
    <arr name="ignored_">
        <str>slideShow</str>
        <str>slide</str>
        <str>slide</str>
        <str>slideNotes</str>
    </arr>
    <str name="keywords">test</str>
    <arr name="last_author">
        <str>Corporate</str>
    </arr>
    <arr name="last_printed">
        <str>2000-03-17T20:28:57Z</str>
    </arr>
    <arr name="last_save_date">
        <str>2009-03-24T16:52:26Z</str>
    </arr>
    <arr name="manager">
        <str>
        </str>
    </arr>
    <arr name="meta">
        <str>stream_source_info</str>
        <str>file:/C:/temp/nuggets/100000.ppt</str>
        <str>Last-Author</str>
        <str>Corporate</str>
        <str>Slide-Count</str>
        <str>2</str>
        <str>custom:DocIDPosition</str>
        <str>0</str>
        <str>Application-Name</str>
        <str>Microsoft PowerPoint</str>
        <str>custom:Delivery Date</str>
        <str>
        </str>
        <str>custom:Event</str>
        <str>
        </str>
        <str>Edit-Time</str>
        <str>284587970000</str>
        <str>Word-Count</str>
        <str>120</str>
        <str>Creation-Date</str>
        <str>2000-03-15T16:57:27Z</str>
        <str>stream_size</str>
        <str>181248</str>
        <str>Manager</str>
        <str>
        </str>
        <str>stream_name</str>
        <str>100000.ppt</str>
        <str>Company</str>
        <str>
        </str>
        <str>Keywords</str>
        <str>test</str>
        <str>Last-Save-Date</str>
        <str>2009-03-24T16:52:26Z</str>
        <str>Revision-Number</str>
        <str>91</str>
        <str>Last-Printed</str>
        <str>2000-03-17T20:28:57Z</str>
        <str>Comments</str>
        <str>version 1.1</str>
        <str>Template</str>
        <str>
        </str>
        <str>custom:PaperSize</str>
        <str>US</str>
        <str>custom:DocID</str>
        <str>
        </str>
        <str>xmpTPg:NPages</str>
        <str>2</str>
        <str>custom:NotesPageLayout</str>
        <str>Lower</str>
        <str>custom:DocIDinSlide</str>
        <str>true</str>
        <str>Category</str>
        <str>POT - US</str>
        <str>custom:Universal Objects</str>
        <str>true</str>
        <str>custom:Final</str>
        <str>false</str>
        <str>custom:DocIDinTitle</str>
        <str>true</str>
        <str>Content-Type</str>
        <str>application/vnd.ms-powerpoint</str>
        <str>custom:Title</str>
        <str>test</str>
    </arr>
    <arr name="p">
        <str>slide-content</str>
        <str>slide-content</str>
    </arr>
    <arr name="revision_number">
        <str>91</str>
    </arr>
    <arr name="slide_count">
        <str>2</str>
    </arr>
    <arr name="stream_name">
        <str>100000.ppt</str>
    </arr>
    <arr name="stream_size">
        <str>181248</str>
    </arr>
    <arr name="stream_source_info">
        <str>file:/C:/temp/test/100000.ppt</str>
    </arr>
    <arr name="template">
        <str>
        </str>
    </arr>
    <!-- Content field -->
    <arr name="text">
        <str>test Test test test test tes t</str>
    </arr>
    <arr name="title">
        <str>test</str>
    </arr>
    <arr name="word_count">
        <str>120</str>
    </arr>
    <arr name="xmptpg_npages">
        <str>2</str>
    </arr>
</doc>
Enoch answered 6/6, 2012 at 7:9 Comment(5)
You'd think that the content field contained "only the content of your pdf", but it doesn't. It contains the PDF content plus all the metadata as a kind of header preceding the content. I'm updating my question with screenshots.Rask
can you update the extract requesthandler as mentioned in my answer and check.Enoch
Mapping content to a field named text with <str name="fmap.content">text</str> doesn't change anything, it merely modifies the field name, and the problem remains. While I appreciate your attentiveness, @jayendra, I am familiar with the content of the Solr wiki and am not a noob setting up Solr for the first time. Do you have any suggestions beyond quoting the documentation?Rask
Nope ... Cause I can't debug your environment and I don't see the behavior you have mentioned.Enoch
Please add a comment when voting down. Helps to improve the answerEnoch
R
0

I no longer have the problem I described above. Since asking the question, I have updated to Solr 4.0 alpha and recreated schema.xml from the Solr Cell example that ships with the 4.0a package. I suspect my original schema was copying the metadata fields' content to the text field, so it was most likely my own error.

Rask answered 1/8, 2012 at 16:50 Comment(2)
Well, i have the same problem and I would rather prefer any information then 'it was fixed by itself'...Homophonic
@taras.roshko After you posted your answer, I checked my Solr 4.x solrconfig.xml (with the ERH that works fine) and sure enough, I have <str name="captureAttr">true</str>. I believe my Solr 3.6 ERH had nothing defined for captureAttr.Rask
S
0

In the solrconfig.xml, where the request handler is defined, add this line below

<str name="fmap.title">ignored_</str>

This tells Tika to simply ignore the title attribute (or which ever attributes you want ignored) it finds embedded within the PDF.

Serles answered 9/10, 2012 at 17:43 Comment(1)
this has nothing in common with title, see OP's screenshotHomophonic
A
0

In my case, <str name="xpath">/xhtml:html/xhtml:body//node()</str> allowed extraction of content without the meta.

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="fmap.meta">ignored_</str>
      <str name="fmap.content">content</str>
      <!-- Specify where content should be extracted exactly -->
      <str name="xpath">/xhtml:html/xhtml:body//node()</str>
    </lst>
</requestHandler>
Ashleaashlee answered 7/7, 2022 at 17:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.