How to sort on analyzed/tokenized field in Elasticsearch?
Asked Answered
S

1

6

We're storing a title field in our index and want to use the field for two purposes:

  1. We're analyzing with an ngram filter so we can provide autocomplete and instant results
  2. We want to be able to list results using an ASC sort on the title field rather than score.

The index/filter/analyzer is defined like so:

array(
    'number_of_shards' => $this->shards,
    'number_of_replicas' => $this->replicas,
    'analysis' => array(
        'filter' => array(
            'nGram_filter' => array(
                'type' => 'nGram',
                'min_gram' => 2,
                'max_gram' => 20,
                'token_chars' => array('letter','digit','punctuation','symbol')
            )
        ),

        'analyzer' => array(
            'index_analyzer' => array(
                'type' => 'custom',
                'tokenizer' =>'whitespace',
                'char_filter' => 'html_strip',
                'filter' => array('lowercase','asciifolding','nGram_filter')
            ),
            'search_analyzer' => array(
                'type' => 'custom',
                'tokenizer' =>'whitespace',
                'char_filter' => 'html_strip',
                'filter' => array('lowercase','asciifolding')
            )
        )
    )
),

The problem we're experiencing is unpredictable results when we Sort on the title field. After doing a little searching, we found this at the end of the sort man page at ElasticSearch... (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-sort.html#_memory_considerations)

For string based types, the field sorted on should not be analyzed / tokenized.

How can we both analyze the field and sort on it later? Do we need to store the field twice with one using not_analyzed in order to sort? Since the field _source is also storing the title value in it's original state, can that not be used to sort on?

Shauna answered 24/4, 2014 at 15:29 Comment(0)
A
7

You can use the built in concept of Multi Field Type in Elasticsearch.

The multi_field type allows to map several core_types of the same value. This can come very handy, for example, when wanting to map a string type, once when it’s analyzed and once when it’s not_analyzed.

In the Elasticsearch Reference, please look at the String Sorting and Multi Fields guide on how to setup what you need.

Please note that Multi Field mapping configuration has changed between Elasticsearch 0.90.X and 1.X. Use the appropriate following guide based on your version:

Assign answered 24/4, 2014 at 15:39 Comment(4)
Exactly what I was looking for, thanks! I especially love the part about The naive approach to indexing the same string in two ways would be to include two separate fields in the document on a related page to the one you linked to ;)Shauna
If you have the slug of title stored, probably this is a "not_analyzed" field, so you can sort by slug.Vomer
Do you know the answer for second part of the question: "Since the field _source is also storing the title value in it's original state, can that not be used to sort on?" I am curious why in this case there is not possibility to order by source value?Thunell
From the Elasticsearch documentation, "The _source field contains the original JSON document body that was passed at index time. The _source field itself is not indexed (and thus is not searchable), but it is stored so that it can be returned when executing fetch requests, like get or search." Sorting fields need to searchable to be sorted properly. So this would exclude you from using the _source field.Assign

© 2022 - 2024 — McMap. All rights reserved.