With solr-4.9 (the latest version as of now), extracting data from rich documents like pdfs, spreadsheets(xls, xlxs family), presentations(ppt, ppts), documentation(doc, txt etc) has become fairly simple.
The sample code examples provided in the downloaded archive from
here contains a basic solr template project to get you started quickly.
The necessary configuration changes are as follows:
Change the solrConfig.xml
to include following lines :
<lib dir="<path_to_extraction_libs>" regex=".*\.jar" />
<lib dir="<path_to_solr_cell_jar>" regex="solr-cell-\d.*\.jar" />
create a request handler as follows:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults" />
</requestHandler>
2.Add the necessary jars from the solrExample to your project.
3.Define the schema as per your needs and fire a query like :
curl "http://localhost:8983/solr/collection1/update/extract?literal.id=1&literal.filename=testDocToExtractFrom.txt&literal.created_at=2014-07-22+09:50:12.234&commit=true" -F "[email protected]"
go to the GUI portal and query to see the indexed contents.
Let me know if you face any problems.