Elasticsearch terms aggregation by strings in an array
Asked Answered
R

1

23

How can I write an Elasticsearch terms aggregation that splits the buckets by the entire term rather than individual tokens? For example, I would like to aggregate by state, but the following returns new, york, jersey and california as individual buckets, not New York and New Jersey and California as the buckets as expected:

curl -XPOST "http://localhost:9200/my_index/_search" -d'
{
    "aggs" : {
        "states" : {
            "terms" : { 
                "field" : "states",
                "size": 10
            }
        }
    }
}'

My use case is like the one described here https://www.elastic.co/guide/en/elasticsearch/guide/current/aggregations-and-analysis.html with just one difference: the city field is an array in my case.

Example object:

{
    "states": ["New York", "New Jersey", "California"]
}

It seems that the proposed solution (mapping the field as not_analyzed) does not work for arrays.

My mapping:

{
    "properties": {
        "states": {
            "type":"object",
            "fields": {
                "raw": {
                    "type":"object",
                    "index":"not_analyzed"
                }
            }
        }
    }
}

I have tried to replace "object" by "string" but this is not working either.

Realpolitik answered 16/11, 2015 at 17:40 Comment(0)
D
19

I think all you're missing is "states.raw" in your aggregation (note that, since no analyzer is specified, the "states" field is analyzed with the standard analyzer; the sub-field "raw" is "not_analyzed"). Though your mapping might bear looking at as well. When I tried your mapping against ES 2.0 I got some errors, but this worked:

PUT /test_index
{
   "mappings": {
      "doc": {
         "properties": {
            "states": {
               "type": "string",
               "fields": {
                  "raw": {
                     "type": "string",
                     "index": "not_analyzed"
                  }
               }
            }
         }
      }
   }
}

Then I added a couple of docs:

POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"states":["New York","New Jersey","California"]}
{"index":{"_id":2}}
{"states":["New York","North Carolina","North Dakota"]}

And this query seems to do what you want:

POST /test_index/_search
{
    "size": 0, 
    "aggs" : {
        "states" : {
            "terms" : { 
                "field" : "states.raw",
                "size": 10
            }
        }
    }
}

returning:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "states": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "New York",
               "doc_count": 2
            },
            {
               "key": "California",
               "doc_count": 1
            },
            {
               "key": "New Jersey",
               "doc_count": 1
            },
            {
               "key": "North Carolina",
               "doc_count": 1
            },
            {
               "key": "North Dakota",
               "doc_count": 1
            }
         ]
      }
   }
}

Here's the code I used to test it:

http://sense.qbox.io/gist/31851c3cfee8c1896eb4b53bc1ddd39ae87b173e

Dendroid answered 16/11, 2015 at 18:6 Comment(1)
Thank you so much for your answer, you are right, my question is indeed missing the .raw. That is because I had tried so many different combinations of mappings and searches and ended up posting that one. Your answer led me to detect that my real problem is, that I am using the elasticsearch-transport-couchbase plugin to import my documents into Elasticsearch and the plugin changes my document structure, surrounding it with a "doc" attribute. Thanks to your answer, I added a document manually, and it worked, and that's how I detected the surrounding "doc" attribute in the other documents.Realpolitik

© 2022 - 2024 — McMap. All rights reserved.