Using Solr 3.6 and the ExtractionRequestHandler (aka Tika), is it possible to map just the textual content (of a PDF) to a field minus the metadata? The "content" field produced by Tika unfortunately contains all the metadata munged in with the text content of the document.
I would like to provide some snippet highlighting of the content and the subject metadata within the content field is skewing the highlight results.
UPDATE: Screenshot of Tika output as indexed by Solr. Highlighted portion is the block of metadata that gets prepended as a block of text to the PDF content.
The ExtractingRequestHandler in solrconfig.xml:
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
</lst>
</requestHandler>
Schema.xml fields. Note "content" receives Tika's content output directly. The "page" and "collection" fields are set with literal values when a doc is posted to the handler.
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="collection" type="text_general" indexed="true" stored="true"/>
<field name="page" type="tint" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>