ElasticSearch - Return Unique Values
Asked Answered
T

8

178

How would I get the values of all the languages from the records and make them unique.

Records

PUT items/1
{ "language" : 10 }

PUT items/2
{ "language" : 11 }

PUT items/3
{ "language" : 10 }

Query

GET items/_search
{ ... }

# => Expected Response
[10, 11]

Any help would be great.

Tupler answered 23/8, 2014 at 18:42 Comment(2)
fields: [languages] will give only the values of the given field, but making them unique is probably easier to do in code. Though may be there is a handy aggregation that can do it for you.Unholy
For those researching this topic, there's also useful discussion here : Find distinct values, not distinct counts in elasticsearchToxoplasmosis
W
245

You can use the terms aggregation.

{
"size": 0,
"aggs" : {
    "langs" : {
        "terms" : { "field" : "language",  "size" : 500 }
    }
}}

The size parameter within the aggregation specifies the maximum number of terms to include in the aggregation result. If you need all results, set this to a value that is larger than the number of unique terms in your data.

A search will return something like:

{
"took" : 16,
"timed_out" : false,
"_shards" : {
  "total" : 2,
  "successful" : 2,
  "failed" : 0
},
"hits" : {
"total" : 1000000,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
  "langs" : {
    "buckets" : [ {
      "key" : "10",
      "doc_count" : 244812
    }, {
      "key" : "11",
      "doc_count" : 136794
 
    }, {
      "key" : "12",
      "doc_count" : 32312
       } ]
    }
  }
}
Weksler answered 23/8, 2014 at 18:49 Comment(11)
"fields" : ["language"] brings back the same result. Can you expand on your answer to see if the aggregation framework can return just the language values? #=> [10, 11, 10]Tupler
@CharlesJHardy, it does not have the same result. The data you are looking for is under the "aggregations" key. I edited my answer with an example result. You can/should also set "size": 0, in order to not include any of the documents, only the aggregated results you wanted.Weksler
Note that if you have many possible values for language you might want to add size=0 and shard_size=0, to make sure you get all values. See elasticsearch.org/guide/en/elasticsearch/reference/current/…Hinkel
I think this answer doesn't address the OP. Original question wants distinct values not counts. Am I missing something?Economist
@BHBH, The answer does provide the distinct values. They are the "key" values, i.e., "10", "11" and "12". (aggregations > langs > buckets > key...)Weksler
It doesn't answer the question. If you're looking for 1 value, it makes sense to get the "key" values in "agg" mode. BUT NOT if you want a giant list of results from "Hits"[] array, but filtered by distinct.Wink
@Dexter, look at the example output. It gives the distinct values without duplicates, which is exactly what the question asked for. The list isn't giant because there were only three distinct languages in the test data set. EDIT: actually, I seem to have had a larger test data set as the original only had two languages.Weksler
You can't use "size": 0 in terms aggregations in elasticsearch 5.x: github.com/elastic/elasticsearch/issues/22136. Either you learn the number of terms from cardinality aggregation or you just put some big number there but not 0.Wildfowl
@EgorIgnatenkov, Thanks, the answer was written long before ES 5. I edited the answer to set the size parameter in the aggregation. Many thanks for your attention to detail and pointing this out!Weksler
@Weksler for latest realte you missing 'keyword' thing. It should looks like { "size": 0, "aggs" : { "langs" : { "terms" : { "field" : "language.keyword", "size" : 500 } } }}Iodic
@Weksler follow up question (taking advantage of the already prepared context) to your super good answer: Is there a way to modify your query so I get back just keys of buckets or I would need to "process" ES results manually for that? in this example I would like to get back just [10, 11, 12]. To ask differently: if I put items in index, do I always get list of items or I can setup query to get list of some properties of item entities (in this case list of language values)?Lamblike
A
24

I am looking for this kind of solution for my self as well. I found reference in terms aggregation.

So, according to that following is the proper solution.

{
"aggs" : {
    "langs" : {
        "terms" : { "field" : "language",  
                    "size" : 500 }
    }
}}

But if you ran into following error:

"error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [fastest_method] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
            }
        ]}

In that case, you have to add "KEYWORD" in the request, like following:

   {
    "aggs" : {
        "langs" : {
            "terms" : { "field" : "language.keyword",  
                        "size" : 500 }
        }
    }}
Amu answered 31/7, 2019 at 19:15 Comment(0)
N
18

If you want to get all unique values without any approximation or setting a magic number (size: 500), then use COMPOSITE AGGREGATION (ES 6.5+).

From official documentation:

"If you want to retrieve all terms or all combinations of terms in a nested terms aggregation you should use the COMPOSITE AGGREGATION which allows to paginate over all possible terms rather than setting a size greater than the cardinality of the field in the terms aggregation. The terms aggregation is meant to return the top terms and does not allow pagination."

Implementation example in JavaScript:

const ITEMS_PER_PAGE = 1000;

const body =  {
    "size": 0, // Returning only aggregation results: https://www.elastic.co/guide/en/elasticsearch/reference/current/returning-only-agg-results.html
    "aggs" : {
        "langs": {
            "composite" : {
                "size": ITEMS_PER_PAGE,
                "sources" : [
                    { "language": { "terms" : { "field": "language" } } }
                ]
            }
        }
     }
};

const uniqueLanguages = [];

while (true) {
  const result = await es.search(body);

  const currentUniqueLangs = result.aggregations.langs.buckets.map(bucket => bucket.key);

  uniqueLanguages.push(...currentUniqueLangs);

  const after = result.aggregations.langs.after_key;

  if (after) {
      // continue paginating unique items
      body.aggs.langs.composite.after = after;
  } else {
      break;
  }
}

console.log(uniqueLanguages);
Nicole answered 7/4, 2020 at 7:9 Comment(1)
This should be the accepted answer. It solves the problem of the size parameterStgermain
W
17

if you want to get the first document for each language field unique value, you can do this:

{
 "query": {
    "match_all": {
    }
  },
  "collapse": {
    "field": "language.keyword",
    "inner_hits": {
    "name": "latest",
      "size": 1
    }
  }
}
Wagshul answered 13/11, 2019 at 12:49 Comment(2)
"name": "latest", is "name" from ES syntax or it is custom user field?Flosser
As a note : the collapse field must be single valuedSericeous
R
11

Elasticsearch 1.1+ has the Cardinality Aggregation which will give you a unique count of the terms, but not the terms themselves.

Note that it is actually an approximation and accuracy may diminish with high-cardinality datasets, but it's generally pretty accurate in my testing.

You can also tune the accuracy with the precision_threshold parameter. The trade-off, or course, is memory usage.

This graph from the docs shows how a higher precision_threshold leads to much more accurate results.


Relative error vs threshold

Repletion answered 14/1, 2015 at 20:18 Comment(4)
Does Cardinality Aggregation guarantee that if a term exists, then it will appear in the results (with a count >= 1)? Or could it possibly miss some terms that only appear once in a large dataset?Yecies
@Yecies it depends on the precision threshold that you set. The higher the threshold, the lesser chance that it will miss. Note that there is a limit of 40,000 in precision threshold setting. Which means, a data set higher than that, there will be an estimate and hence the single value may be missedShelter
I believe this answer is wrong. The cardinality aggregation is an excellent tool. However, the task was to retrieve the terms themselves, not estimate how many different terms there are.Weksler
While this is the wrong answer, it is what I was looking for when I googled Count distinct values elasticsearchEndor
S
4

To had to distinct by two fields (derivative_id & vehicle_type) and to sort by cheapest car. Had to nest aggs.

GET /cars/_search
{
  "size": 0,
  "aggs": {
    "distinct_by_derivative_id": {
      "terms": { 
        "field": "derivative_id"
      },
      "aggs": {
        "vehicle_type": {
          "terms": {
            "field": "vehicle_type"
          },
          "aggs": {
            "cheapest_vehicle": {
              "top_hits": {
                "sort": [
                  { "rental": { "order": "asc" } }
                ],
                "_source": { "includes": [ "manufacturer_name",
                  "rental",
                  "vehicle_type" 
                  ]
                },
                "size": 1
              }
            }
          }
        }
      }
    }
  }
}

Result:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 8,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "distinct_by_derivative_id" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "04",
          "doc_count" : 3,
          "vehicle_type" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "CAR",
                "doc_count" : 2,
                "cheapest_vehicle" : {
                  "hits" : {
                    "total" : {
                      "value" : 2,
                      "relation" : "eq"
                    },
                    "max_score" : null,
                    "hits" : [
                      {
                        "_index" : "cars",
                        "_type" : "_doc",
                        "_id" : "8",
                        "_score" : null,
                        "_source" : {
                          "vehicle_type" : "CAR",
                          "manufacturer_name" : "Renault",
                          "rental" : 89.99
                        },
                        "sort" : [
                          89.99
                        ]
                      }
                    ]
                  }
                }
              },
              {
                "key" : "LCV",
                "doc_count" : 1,
                "cheapest_vehicle" : {
                  "hits" : {
                    "total" : {
                      "value" : 1,
                      "relation" : "eq"
                    },
                    "max_score" : null,
                    "hits" : [
                      {
                        "_index" : "cars",
                        "_type" : "_doc",
                        "_id" : "7",
                        "_score" : null,
                        "_source" : {
                          "vehicle_type" : "LCV",
                          "manufacturer_name" : "Ford",
                          "rental" : 99.99
                        },
                        "sort" : [
                          99.99
                        ]
                      }
                    ]
                  }
                }
              }
            ]
          }
        },
        {
          "key" : "01",
          "doc_count" : 2,
          "vehicle_type" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "CAR",
                "doc_count" : 1,
                "cheapest_vehicle" : {
                  "hits" : {
                    "total" : {
                      "value" : 1,
                      "relation" : "eq"
                    },
                    "max_score" : null,
                    "hits" : [
                      {
                        "_index" : "cars",
                        "_type" : "_doc",
                        "_id" : "1",
                        "_score" : null,
                        "_source" : {
                          "vehicle_type" : "CAR",
                          "manufacturer_name" : "Ford",
                          "rental" : 599.99
                        },
                        "sort" : [
                          599.99
                        ]
                      }
                    ]
                  }
                }
              },
              {
                "key" : "LCV",
                "doc_count" : 1,
                "cheapest_vehicle" : {
                  "hits" : {
                    "total" : {
                      "value" : 1,
                      "relation" : "eq"
                    },
                    "max_score" : null,
                    "hits" : [
                      {
                        "_index" : "cars",
                        "_type" : "_doc",
                        "_id" : "2",
                        "_score" : null,
                        "_source" : {
                          "vehicle_type" : "LCV",
                          "manufacturer_name" : "Ford",
                          "rental" : 599.99
                        },
                        "sort" : [
                          599.99
                        ]
                      }
                    ]
                  }
                }
              }
            ]
          }
        },
        {
          "key" : "02",
          "doc_count" : 2,
          "vehicle_type" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "CAR",
                "doc_count" : 2,
                "cheapest_vehicle" : {
                  "hits" : {
                    "total" : {
                      "value" : 2,
                      "relation" : "eq"
                    },
                    "max_score" : null,
                    "hits" : [
                      {
                        "_index" : "cars",
                        "_type" : "_doc",
                        "_id" : "4",
                        "_score" : null,
                        "_source" : {
                          "vehicle_type" : "CAR",
                          "manufacturer_name" : "Audi",
                          "rental" : 499.99
                        },
                        "sort" : [
                          499.99
                        ]
                      }
                    ]
                  }
                }
              }
            ]
          }
        },
        {
          "key" : "03",
          "doc_count" : 1,
          "vehicle_type" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "CAR",
                "doc_count" : 1,
                "cheapest_vehicle" : {
                  "hits" : {
                    "total" : {
                      "value" : 1,
                      "relation" : "eq"
                    },
                    "max_score" : null,
                    "hits" : [
                      {
                        "_index" : "cars",
                        "_type" : "_doc",
                        "_id" : "5",
                        "_score" : null,
                        "_source" : {
                          "vehicle_type" : "CAR",
                          "manufacturer_name" : "Audi",
                          "rental" : 399.99
                        },
                        "sort" : [
                          399.99
                        ]
                      }
                    ]
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }
}
Selmore answered 18/1, 2021 at 11:35 Comment(0)
S
0

Short solution to be used in console, inspired by discussion post

POST _sql?format=txt
{
  "query": """SELECT language FROM "index-name" GROUP by language"""
}
Soldier answered 27/4 at 20:27 Comment(0)
P
-2

aggs will work on number value by default if you want work on string filed you should enable it on fie

Pinto answered 19/9, 2021 at 5:33 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Goiter

© 2022 - 2024 — McMap. All rights reserved.