What is the fastest way to get all _ids of a certain index from ElasticSearch? Is it possible by using a simple query? One of my index has around 20,000 documents.
Edit: Please also read the answer from Aleck Landgraf
You just want the elasticsearch-internal _id
field? Or an id
field from within your documents?
For the former, try
curl http://localhost:9200/index/type/_search?pretty=true -d '
{
"query" : {
"match_all" : {}
},
"stored_fields": []
}
'
If you are using Elastic dev tools, use this instead:
GET <your-index-name>/_search
{
"query" : {
"match_all" : {}
},
"stored_fields": []
}
Note 2017 Update: The post originally included "fields": []
but since then the name has changed and stored_fields
is the new value.
The result will contain only the "metadata" of your documents
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 1.0,
"hits" : [ {
"_index" : "index",
"_type" : "type",
"_id" : "36",
"_score" : 1.0
}, {
"_index" : "index",
"_type" : "type",
"_id" : "38",
"_score" : 1.0
}, {
"_index" : "index",
"_type" : "type",
"_id" : "39",
"_score" : 1.0
}, {
"_index" : "index",
"_type" : "type",
"_id" : "34",
"_score" : 1.0
} ]
}
}
For the latter, if you want to include a field from your document, simply add it to the fields
array
curl http://localhost:9200/index/type/_search?pretty=true -d '
{
"query" : {
"match_all" : {}
},
"fields": ["document_field_to_be_returned"]
}
'
fields
was removed, instead, add "_source": false
param. –
Rotz Better to use scroll and scan to get the result list so elasticsearch doesn't have to rank and sort the results.
With the elasticsearch-dsl
python lib this can be accomplished by:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
es = Elasticsearch()
s = Search(using=es, index=ES_INDEX, doc_type=DOC_TYPE)
s = s.fields([]) # only get ids, otherwise `fields` takes a list of field names
ids = [h.meta.id for h in s.scan()]
Console log:
GET http://localhost:9200/my_index/my_doc/_search?search_type=scan&scroll=5m [status:200 request:0.003s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.003s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
...
Note: scroll pulls batches of results from a query and keeps the cursor open for a given amount of time (1 minute, 2 minutes, which you can update); scan disables sorting. The scan
helper function returns a python generator which can be safely iterated through.
fields
has been removed in version 5.0.0
(see: elasticsearch-dsl.readthedocs.io/en/latest/…. You should now use s = s.source([])
. –
Hydrophobic For elasticsearch 5.x, you can use the "_source" field.
GET /_search
{
"_source": false,
"query" : {
"term" : { "user" : "kimchy" }
}
}
"fields"
has been deprecated.
(Error: "The field [fields] is no longer supported, please use [stored_fields] to retrieve stored fields or _source filtering if the field is not stored")
Elaborating on answers by Robert Lujo and Aleck Landgraf, if you want the IDs in a list from the returned generator, here is what I use:
from elasticsearch import Elasticsearch
from elasticsearch import helpers
es = Elasticsearch(hosts=[YOUR_ES_HOST])
hits = helpers.scan(
es,
query={"query":{"match_all": {}}},
scroll='1m',
index=INDEX_NAME
)
ids = [hit['_id'] for hit in hits]
Another option
curl 'http://localhost:9200/index/type/_search?pretty=true&fields='
will return _index, _type, _id and _score.
stored_fields
instead of fields
for newer versions –
Radford I know this post has a lot of answers, but I want to combine several to document what I've found to be fastest (in Python anyway). I'm dealing with hundreds of millions of documents, rather than thousands.
The helpers
class can be used with sliced scroll and thus allow multi-threaded execution. In my case, I have a high cardinality field to provide (acquired_at
) as well. You'll see I set max_workers
to 14, but you may want to vary this depending on your machine.
Additionally, I store the doc ids in compressed format. If you're curious, you can check how many bytes your doc ids will be and estimate the final dump size.
# note below I have es, index, and cluster_name variables already set
max_workers = 14
scroll_slice_ids = list(range(0,max_workers))
def get_doc_ids(scroll_slice_id):
count = 0
with gzip.open('/tmp/doc_ids_%i.txt.gz' % scroll_slice_id, 'wt') as results_file:
query = {"sort": ["_doc"], "slice": { "field": "acquired_at", "id": scroll_slice_id, "max": len(scroll_slice_ids)+1}, "_source": False}
scan = helpers.scan(es, index=index, query=query, scroll='10m', size=10000, request_timeout=600)
for doc in scan:
count += 1
results_file.write((doc['_id'] + '\n'))
results_file.flush()
return count
if __name__ == '__main__':
print('attempting to dump doc ids from %s in %i slices' % (cluster_name, len(scroll_slice_ids)))
with futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
doc_counts = executor.map(get_doc_ids, scroll_slice_ids)
If you want to follow along with how many ids are in the files, you can use unpigz -c /tmp/doc_ids_4.txt.gz | wc -l
.
results_file
var). –
Burkey For Python users: the Python Elasticsearch client provides a convenient abstraction for the scroll API:
from elasticsearch import Elasticsearch, helpers
client = Elasticsearch()
query = {
"query": {
"match_all": {}
}
}
scan = helpers.scan(client, index=index, query=query, scroll='1m', size=100)
for doc in scan:
# do something
you can also do it in python, which gives you a proper list:
import elasticsearch
es = elasticsearch.Elasticsearch()
res = es.search(
index=your_index,
body={"query": {"match_all": {}}, "size": 30000, "fields": ["_id"]})
ids = [d['_id'] for d in res['hits']['hits']]
Inspired by @Aleck-Landgraf answer, for me it worked by using directly scan function in standard elasticsearch python API:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan
es = Elasticsearch()
for dobj in scan(es,
query={"query": {"match_all": {}}, "fields" : []},
index="your-index-name", doc_type="your-doc-type"):
print dobj["_id"],
I suggest using a neat tool like elasticdump and issue a query like the following:
~/.bin/elasticdump --input='http://username:[email protected]:9200/my-index' --output=output.txt --searchBody='{"_source": ["_id"], "query":{ "match_all": {}}}' --limit 10000
then you can process the output.txt file using the cut
linux command and get only the id part for each document
This is working!
def select_ids(self, **kwargs):
"""
:param kwargs:params from modules
:return: array of incidents
"""
index = kwargs.get('index')
if not index:
return None
# print("Params", kwargs)
query = self._build_query(**kwargs)
# print("Query", query)
# get results
results = self._db_client.search(body=query, index=index, stored_fields=[], filter_path="hits.hits._id")
print(results)
ids = [_['_id'] for _ in results['hits']['hits']]
return ids
Url -> http://localhost:9200/<index>/<type>/_query
http method -> GET
Query -> {"query": {"match_all": {}}, "size": 30000, "fields": ["_id"]}
© 2022 - 2024 — McMap. All rights reserved.