Elasticsearch "More Like This" API vs. more_like_this query
Asked Answered
P

2

25

Elasticsearch has two similar features to get "similar" documents:

There is the "More Like This API". It gives me documents similar to a given one. I can't use it in more complex expressions though.

There is also the "more_like_this" query for use in the Search API I can use it in bool or boosting expressions, but I can't give it an id of a document. I have to provide the "like_text" parameter.

I have documents with tags and content. Some documents will have good tags and some won't have any. I want a "Similar documents" feature that will work every time but will rank documents with matching tags higher than documents with matching text. My idea was:

{
    "boosting" : {
        "positive" : {
            "more_like_this" : {
                "fields" : ["tag"],
                "id" : "23452",
                "min_term_freq" : 1
            }
        },
        "negative" : {
            "more_like_this" : {
                "fields" : ["tag"],
                "id" : "23452",
            }
        },
        "negative_boost" : 0.2
    }
}

Obviously this doesn't work because there is no "id" in "more_like_this". What are the alternatives?

Photothermic answered 8/3, 2013 at 18:17 Comment(0)
M
49

First of all a little introduction about the more like this functionality and how it works. The idea is that you have a specific document and you want to have some others that are similar to it.

In order to achieve this we need to extract some content out of the current document and use it to make a query to get similar ones. We can extract content from the lucene stored fields (or the elasticsearch _source field, which is effectively a stored field in lucene) and somehow reanalyze it or use the information stored in the term vectors (if enabled while indexing) to get a list of terms that we can use to query, without having to reanalyze the text. I'm not sure whether elasticsearch tries this latter approach if term vectors are available though.

The more like this query allows you to provide a text, regardless of where you got it from. That text will be used to query the fields that you select and get back similar documents. The text will not be entirely used, but reanalyzed, and only a maximum of max_query_terms (default 25) will be kept, out of the terms that have at least the provided min_term_freq (minimum term frequency, default 2) and document frequency between min_doc_freq and max_doc_freq. There are more parameters too that can influence the generated query.

The more like this api goes one step further, allowing to provide the id of a document and, again, a list of fields. The content of those fields will be extracted from that specific document and used to make a more like this query on the same fields. That means that the generated more like this query will have the property text containing the text previously extracted and will be performed on the same fields. As you can see the more like this api executes a more like this query under the hood.

Let's say the more like this query gives you more flexibility, since you can combine it with other queries and you can get the text from whatever source you like. On the other hand the more like this api exposes the common functionality doing some more work for you but with some restrictions.

In your case I would combine a couple of different more like this queries together, so that you can make use of the powerful elasticsearch query DSL, boost queries differently and so on. The downside is that you have to provide the text yourself, since you can't provide the id of the document to extract it from.

There are different ways to achieve what you want. I would use a bool query to combine the two more like this queries in a should clause and give them a different weight. I would also use the more like this field query instead, since you want to query a single field at a time.

{
    "bool" : {
        "must" : {
          {"match_all" : { }}
        },
        "should" : [
            {
              "more_like_this_field" : {
                "tags" : {
                  "like_text" : "here go the tags extracted from the current document!",
                  "boost" : 2.0
                }
              }
            },
            {
              "more_like_this_field" : {
                "content" : {
                  "like_text" : "here goes the content extracted from the current document!"
                }
              }
            }
        ],
        "minimum_number_should_match" : 1
    }
}

This way at least one of the should clauses must match, and a match on tags is more important than a match on content.

Melanson answered 10/3, 2013 at 8:49 Comment(2)
Thanks for the answer. So the only alternative to the non-existent "id" JSON name is to get the full text and place it in "like_text". There is no way to avoid the round-trip of the full text. There is also no way to reduce it. E.g. there's is no way to access the term vector of a document and get only the 25 "top terms", so that I can place them directly in the "like_text" and get the same results I'd get with the full text. Please confirm. I was thinking about writing an elasticsearch plugin that would give me top n terms for a document. Do you think that would work?Maddux
As far as I know there's no out-of-the-box way to achieve what you want. You could probably write a plugin that exposes a new type of more like this query that accepts the id of a document as input and gets the text from it, maybe even using term vectors when available.Melanson
U
12

This is possible now with the new like syntax:

{
    "more_like_this" : {
        "fields" : ["title", "description"],
        "like" : [
        {
            "_index" : "imdb",
            "_type" : "movies",
            "_id" : "1"
        },
        {
            "_index" : "imdb",
            "_type" : "movies",
            "_id" : "2"
        }],
        "min_term_freq" : 1,
        "max_query_terms" : 12
    }
}

See here: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html

Uraninite answered 20/7, 2015 at 19:41 Comment(1)
In recent ElasticSearch versions, the docs keyword has been deprecated in favour of like.Mages

© 2022 - 2024 — McMap. All rights reserved.