First of all a little introduction about the more like this functionality and how it works. The idea is that you have a specific document and you want to have some others that are similar to it.
In order to achieve this we need to extract some content out of the current document and use it to make a query to get similar ones. We can extract content from the lucene stored fields (or the elasticsearch _source field, which is effectively a stored field in lucene) and somehow reanalyze it or use the information stored in the term vectors (if enabled while indexing) to get a list of terms that we can use to query, without having to reanalyze the text. I'm not sure whether elasticsearch tries this latter approach if term vectors are available though.
The more like this query allows you to provide a text, regardless of where you got it from. That text will be used to query the fields that you select and get back similar documents. The text will not be entirely used, but reanalyzed, and only a maximum of max_query_terms
(default 25) will be kept, out of the terms that have at least the provided min_term_freq
(minimum term frequency, default 2) and document frequency between min_doc_freq
and max_doc_freq
. There are more parameters too that can influence the generated query.
The more like this api goes one step further, allowing to provide the id of a document and, again, a list of fields. The content of those fields will be extracted from that specific document and used to make a more like this query on the same fields. That means that the generated more like this query will have the property text containing the text previously extracted and will be performed on the same fields. As you can see the more like this api executes a more like this query under the hood.
Let's say the more like this query gives you more flexibility, since you can combine it with other queries and you can get the text from whatever source you like.
On the other hand the more like this api exposes the common functionality doing some more work for you but with some restrictions.
In your case I would combine a couple of different more like this queries together, so that you can make use of the powerful elasticsearch query DSL, boost queries differently and so on. The downside is that you have to provide the text yourself, since you can't provide the id of the document to extract it from.
There are different ways to achieve what you want. I would use a bool query to combine the two more like this queries in a should clause and give them a different weight. I would also use the more like this field query instead, since you want to query a single field at a time.
{
"bool" : {
"must" : {
{"match_all" : { }}
},
"should" : [
{
"more_like_this_field" : {
"tags" : {
"like_text" : "here go the tags extracted from the current document!",
"boost" : 2.0
}
}
},
{
"more_like_this_field" : {
"content" : {
"like_text" : "here goes the content extracted from the current document!"
}
}
}
],
"minimum_number_should_match" : 1
}
}
This way at least one of the should clauses must match, and a match on tags is more important than a match on content.
"id"
JSON name is to get the full text and place it in"like_text"
. There is no way to avoid the round-trip of the full text. There is also no way to reduce it. E.g. there's is no way to access the term vector of a document and get only the 25 "top terms", so that I can place them directly in the"like_text"
and get the same results I'd get with the full text. Please confirm. I was thinking about writing an elasticsearch plugin that would give me top n terms for a document. Do you think that would work? – Maddux