Counting number of documents using Elasticsearch
Asked Answered
D

7

108

If one wants to count the number of documents in an index (of Elasticsearch) then there are (at least?) two possibilities:

  • Direct count

    POST my_index/_count

    should return the number of documents in my_index.

  • Using search

    Here one can use the count as the search_type or some other type. In either of the cases the total count can be extracted from the field ['hits']['total']

My questions are:

  • what is the difference between the different approaches? Which one should I prefer?

  • I raise this question because I'm experiencing different results depending on the chosen method. I'm now in the process of debugging the issue, and this question popped up.

Duvetyn answered 9/9, 2014 at 8:26 Comment(2)
Getting the count is a GET request: {"count":27053653,"_shards":{"total":3,"successful":3,"skipped":0,"failed":0}}Balance
{"count":619397,"_shards":{"total":46,"successful":46,"skipped":0,"failed":0}} got the resp thanks...Loony
N
71

Probably _count is a bit faster since it doesn't have to execute a full query with ranking and result fetching and can simply return the size.

It would be interesting to know a bit more about how you manage to get different results though. For that I need more information like what exact queries you are sending and if any indexing is going on on the index.

But suppose that you do the following

  1. index some documents
  2. refresh the index

_search and _count (with a match all query) should return the same total. If not, that'd be very weird.

Nike answered 9/9, 2014 at 11:37 Comment(7)
Unfortunately I won't be able to share the data. Furthermore, since the problem was not 100% reproducible, it would be hard to come up with a minimal example. That's why I asked it as a general question.Duvetyn
won't need the data of course, just anonymize it. But the actual request + response would be useful. Without that it is going to be pretty hard to figure out what you are doing wrong.Nike
apparently _count api is being deprecated in es 2.0 for reasons of being redundant given that you can search with size=0Nike
@JillesvanGurp are you sure _count is deprecated in version 2? ES documentation is pretty good at including deprecation notices, and there is none for _count.Arteritis
_count definitely been removed in v5, I believe. Here is the ticket where the removal is discussed: github.com/elastic/elasticsearch/issues/13928. Searchtype=_count is gone in 5.1 as well.Nike
The _count is there, even in version 7. Not sure about the clients, but in ES itself Count API endpoint has not been removed. See: elastic.co/guide/en/elasticsearch/reference/7.1/…Sabir
It seems you are right. They've been debating this for ages: github.com/elastic/elasticsearch/issues/13928Nike
Q
56

If _search must be used instead of _count, and you're on Elasticsearch 7.0+, setting size: 0 and track_total_hits: true will provide the same info as _count

GET my-index/_search
{
  "query": { "term": { "field": { "value": "xyz" } } },
  "size": 0,
  "track_total_hits": true
}


{
  "took" : 612,
  "timed_out" : false,
  "_shards" : {
    "total" : 629,
    "successful" : 629,
    "skipped" : 524,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 29349466,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

See Elasticsearch 7.0 Breaking changes

Questioning answered 25/3, 2020 at 20:59 Comment(2)
If there are 10,000+ documents that match and if I only want to retrieve the first 10,000 as set by index.max_result_window but want to get the actual count, would it be faster to set track_total_hits: true or if the hits count > 10,000 then issue a _count query to get the actual count?Vday
hmm well I'm testing this on Elastic 8.5 but count returns all items in index and search if we limit the results by query, set size =0 , track_total_hits: true is returning count of what the query would normally return (lets say term query which will limit results a bit). In my case count = 160 k and search with mentioned params = 130 k. So i don't see how its the same thing. I guess it would be if query would be a matchAll query.Jennet
F
36

curl http://localhost:9200/_cat/indices?v provides you the count and other information in a tabular format

health status index                              uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   logstash-2019.10.09-000001         IS7HBUgRRzO7Rn1puBFUIQ   1   1          0            0       283b           283b
green  open   .kibana_task_manager_1             e4zZcF9wSQGFHB_lzTszrg   1   0          2            0     12.5kb         12.5kb
yellow open   metricbeat-7.4.0-2019.10.09-000001 h_CWzZHcRsakxgyC36-HTg   1   1       6118            0      2.2mb          2.2mb
green  open   .apm-agent-configuration           J6wkUr2CQAC5kF8-eX30jw   1   0          0            0       283b           283b
green  open   .kibana_2                          W2ZETPygS8a83-Xcd6t44Q   1   0       1836           23      1.1mb          1.1mb
green  open   .kibana_1                          IrBlKqO0Swa6_HnVRYEwkQ   1   0          8            0    208.8kb        208.8kb
yellow open   filebeat-7.4.0-2019.10.09-000001   xSd2JdwVR1C9Ahz2SQV9NA   1   1          0            0       283b           283b
green  open   .tasks                             0ZzzrOq0RguMhyIbYH_JKw   1   0          1            0      6.3kb          6.3kb
Foley answered 9/10, 2019 at 20:56 Comment(1)
Note also that format=json or format=yaml can be provided as a query param to return CAT endpoint results in machine-readable format. Additionally, the returned columns may be filtered with h=index,docs.count, so a succinct way to retrieve doc counts for many indices might be: curl http://localhost:9200/_cat/indices?h=index,docs.count&format=jsonQuestioning
P
25

Old question, chipping in because on ElasticSearch version > 7.0 :

  1. _search: returns the documents with the hit count for the search query, less than or equal to the result window size, which is typically 10,000. e.g.:

    {"took":3,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":10000,"relation":"gte"},"max_score": 0.34027478,"hits":[...]}}

  2. _count: returns the total number of hits for the search query irrespective of the result window size. no documents returned, e.g.:

    {"count":5703899,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}

So, _search might return the total hits as 10,000 if that is your configured result window size, while _count would return the actual count for the same query.

Pad answered 20/1, 2020 at 10:52 Comment(4)
Thank you. This should be the accepted answer as _search will never return a count above 10,000.Giannini
disagree, see https://mcmap.net/q/202754/-counting-number-of-documents-using-elasticsearch aboveLucrecialucretia
Thank you! Is there a way to implement the Count api in NEST (.net client)? I know if you do GET /INDEX*/_count that will return the count but how would you do that for nest?Gabbro
@Gabbro No idea, haven't used the .net client :)Pad
C
2

The two queries provide the same result but: - count consumes less resources/bandwidth because doesn't require to fetch documents, scoring and other internal optimizations. Set the search size to 0, could be very similar.

If you want count all the record in an index, you can also execute an aggregation terms on "_type" field.

The results should be the same. Before comparing the results, be sure to execute an index refresh.

Chapen answered 9/9, 2014 at 13:29 Comment(1)
The terms aggregation has the pit hole of the accuracy. You have to set a large size, and it is always bounded from above by MAX_INT...Duvetyn
E
0

you can get total doc with

 GET _cat/indices/<index_name>/?h=docs.count
Enormous answered 14/9, 2023 at 8:58 Comment(1)
Thank you for your interest in contributing to the Stack Overflow community. This question already has quite a few answers—including one that has been extensively validated by the community. Are you certain your approach hasn’t been given previously? If so, it would be useful to explain how your approach is different, under what circumstances your approach might be preferred, and/or why you think the previous answers aren’t sufficient. Can you kindly edit your answer to offer an explanation?Bim
M
-1

If you want to check index by index, you can use the following query

GET _all/_search
{
  "size": 0, 
  "aggs": {
    "NAME": {
      "terms": {
        "field": "_index",
        "size": 100000
      }
    }
  }
}

The result will be the following screenshot. enter image description here

Melainemelamed answered 21/9, 2022 at 11:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.