Customising the search algorithm of Elasticsearch

I originally tried posting a similar post to the elasticsearch mailing list (https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/BZLFJSEpl78) but didn't get any helpful responses so I though I'd give Stack Overflow a try. This is my first post on SO so apologies if it doesn't quite fit into the mould it is meant to.

I'm currently working with a university helping them to implement a test suite to further refine some research they have been conducting. Their research is based around dynamic schema searching. After spending some time evaluating the various open source search solutions I settled on elasticsearch as the base platform and I am wondering what the best way to proceed would be. I have spent about a week looking into the elasticsearch documentation and the code itself and also reading the documentation of Lucene but I am struggling to see a clear way forward.

The goal of the project is to provide the researches with a piece of software they can use to plugin revisions of the searching algorithm to test and refine. They would like to be able to write the pluggable algorithm in languages other then Java that is supported by the JVM like Groovy, Python or Closure but that isn't a hard requirement. Part of that will be to provide them with a front end to run queries and see output and an admin interface to add documents to an index. I am comfortable with all of that thanks to the very powerful and complete REST API. What I am not so sure about is how to proceed with implementing the pluggable search algorithm.

The researcher's algorithm requires 4 inputs to function:

The query terms(s).
A Word (term) x Document matrix across a index.
A Document x Word (term) matrix across a index.
A Word (term) frequency list across a index. That is how many times each word appears across the entire index.

For their purposes, a document doesn't correspond to an actual real-world document (they actually call them text events). Rather, for now, it corresponds to one sentence (having that configurable might also be useful). I figure the best way to handle this is to break down documents into their sentences (using Apache Tika or something similar), putting each sentence in as its own document in the index. I am confident I can do this in the Admin UI I provide using the mapper-attachement plugin as a starting point. The downside is that breaking up the document before giving it to elasticsearch isn't a very configurable way of doing it. If they want to change the resolution to their algorithm, they would need to re-add all documents to the index again. If the index stored that full documents as is and the search algorithm could chose what resolution to work at per query then that would be perfect. I'm not sure it is possible or not though.

The next problem is how to get the three inputs they require and pass it into their pluggable search algorithm. I'm really struggling where to start with this one. It seems from looking at Luecene that I need to provide my own search/query implementation, but I'm not sure if this is right or not. There also doesn't seem to be any search plugins listed on the elasticsearch site, so I'm not even sure if it is possible. The important things here are that the algorithm needs to operate at the index level with the query terms available to generate its schema before using the schema to score each document in the index. From what I can tell, this means that the scripting interface provided by elasticsearch won't be of any use. The description of the scripting interface in the elasticsearch guide makes it sound like a script operates at the document level and not the index level. Other concerns/considerations are the ability to program this algorithm in a range of languages (just like the scripting interface) and the ability to augment what is returned by the REST API for a search to include the schema the algorithm generated (which I assume means I will need to define my own REST endpoint(s)).

Can anybody give me some advice on where to get started here? It seems like I am going to have to write my own search plugin that can accept scripts as it's core algorithm. The plugin will be responsible for organising the 4 inputs that I outlined earlier before passing control to the script. It will also be responsible for getting the output from the script and returning it via it's own REST API. Does this seem logical? If so, how do I get started with doing this? What parts of the code do I need to look it?

Recommended topics

Hot tags