Elastic search not giving data with big number for page size
Asked Answered
C

2

5

Size of data to get: 20,000 approx

Issue: searching Elastic Search indexed data using below command in python

but not getting any results back.

from pyelasticsearch import ElasticSearch
es_repo = ElasticSearch(settings.ES_INDEX_URL)
search_results = es_repo.search(
            query, index=advertiser_name, es_from=_from, size=_size)

If I give size less than or equal to 10,000 it works fine but not with 20,000 Please help me find an optimal solution to this.

PS: On digging deeper into ES found this message error:

Result window is too large, from + size must be less than or equal to: [10000] but was [19999]. See the scrolling API for a more efficient way to request large data sets.

Calix answered 16/3, 2018 at 12:18 Comment(8)
you prefer a solution for real time use or for analysis?Bautzen
@Bautzen its real-time use.Calix
the scroll query that appears in your error stack is the optimal solution for analysis, not for real- time because needs more resources to run. Please read the search_after query page of the es documentation mentioned in my answerBautzen
Thanks but I am not sure what is difference between real time and analysis . Can you help me understand ?Calix
Please read the explanation provided by elasticsearch developers here discuss.elastic.co/t/scroll-vs-search-api/28294Bautzen
However if you prefer scroll query, I will provide you an exampleBautzen
Yes please i want to try that if possible.Calix
I have updated my answer to explain also with scroll queryBautzen
B
13

for real time use the best solution is to use the search after query . You need only a date field, and another field that uniquely identify a doc - it's enough a _id field or an _uid field. Try something like this, in my example I would like to extract all the documents that belongs to a single user - in my example the user field has a keyword datatype:

from elasticsearch import Elasticsearch


es = Elasticsearch()
es_index = "your_index_name"
documento = "your_doc_type"

user = "Francesco Totti"

body2 = {
        "query": {
        "term" : { "user" : user } 
            }
        }

res = es.count(index=es_index, doc_type=documento, body= body2)
size = res['count']


body = { "size": 10,
            "query": {
                "term" : {
                    "user" : user
                }
            },
            "sort": [
                {"date": "asc"},
                {"_uid": "desc"}
            ]
        }

result = es.search(index=es_index, doc_type=documento, body= body)
bookmark = [result['hits']['hits'][-1]['sort'][0], str(result['hits']['hits'][-1]['sort'][1]) ]

body1 = {"size": 10,
            "query": {
                "term" : {
                    "user" : user
                }
            },
            "search_after": bookmark,
            "sort": [
                {"date": "asc"},
                {"_uid": "desc"}
            ]
        }




while len(result['hits']['hits']) < size:
    res =es.search(index=es_index, doc_type=documento, body= body1)
    for el in res['hits']['hits']:
        result['hits']['hits'].append( el )
    bookmark = [res['hits']['hits'][-1]['sort'][0], str(result['hits']['hits'][-1]['sort'][1]) ]
    body1 = {"size": 10,
            "query": {
                "term" : {
                    "user" : user
                }
            },
            "search_after": bookmark,
            "sort": [
                {"date": "asc"},
                {"_uid": "desc"}
            ]
        }

Then you will find all the doc appended to the result var

If you would like to use scroll query - doc here:

from elasticsearch import Elasticsearch, helpers

es = Elasticsearch()
es_index = "your_index_name"
documento = "your_doc_type"

user = "Francesco Totti"

body = {
        "query": {
        "term" : { "user" : user } 
             }
        }

res = helpers.scan(
                client = es,
                scroll = '2m',
                query = body, 
                index = es_index)

for i in res:
    print(i)
Bautzen answered 16/3, 2018 at 12:47 Comment(1)
Thanks I found another reference like this here : gist.github.com/drorata/146ce50807d16fd4a6aa Thanks for yours help!Calix
K
0

Probably its ElasticSearch constraints.

index.max_result_window index setting which defaults to 10,000
Kazan answered 16/3, 2018 at 12:21 Comment(1)
Got you. So, I tried to get second page also with search_results = es_repo.search( query, index=advertiser_name, es_from=1, size=10000) but not response. Got first page fine though.Calix

© 2022 - 2024 — McMap. All rights reserved.