How Elastic Search delete_by_query work ? What happens when we insert new data and retrieve the same while deleting documents?
Asked Answered
C

1

7

I wanted to know more about elastic delete, it's Java high level delete api & weather it's feasible to perform bulk delete.

Following are the config information

  • Java: 8
  • Elastic Version: 7.1.1
  • Elastic dependencies added:

    <dependency>
        <groupId>org.elasticsearch.client</groupId>
        <artifactId>elasticsearch-rest-high-level-client</artifactId>
        <version>7.1.1</version>
    </dependency>
    
    <dependency>
        <groupId>org.elasticsearch</groupId>
        <artifactId>elasticsearch</artifactId>
        <version>7.1.1</version>
    </dependency>
    

In my case daily around 10K records are added into the index dev-answer. I want to trigger delete operation (this can be triggered daily or once in a week or once in a month) which will basically delete all documents from above index if specific condition is satisfied. (Which I'll give in DeleteByQueryRequest)

For delete there is an api as given in latest doc which I'm referring.

DeleteByQueryRequest request = new DeleteByQueryRequest("source1", "source2");

While reading the documentation I came across following queries which I'm unable to understand.

  1. As in doc: It’s also possible to limit the number of processed documents by setting size. request.setSize(10); What does processed document means ? Will it delete only 10 documents ?

  2. What batch size I should set ? request.setBatchSize(100); it's performance is based on how many documents we are going to delete ?

    Should I first make a call to get no of documents & based on that setBatchSize should be changed ?

  3. request.setSlices(2); Slices should be depend on how many cores executor machine have or on no of cores in elastic cluster ?

  4. In documentation the method setSlices(2) is given which I'm unable to find in class org.elasticsearch.index.reindex.DeleteByQueryRequest. What I'm missing here ?

  5. Let's consider if I'm executing this delete query in async mode which is taking 0.5-1.0 sec, meanwhile if I'm doing get request on this index, will it give some exception ? Also in the same time if I inserted new document & retrieving the same, will it be able to give response ?

Charlenacharlene answered 5/7, 2019 at 9:30 Comment(7)
Very interesting question. I understand you are asking about _delete_by_query endpoint not the _bulk endpoint ? If you are asking about the _delete_by_query endpoint, can you rename the question to avoid misunderstanding because _bulk also allows to delete documents.Cranio
I'm bit confused, for sure it's delete by query but in java api I'm going to use the function: public final void deleteByQueryAsync(DeleteByQueryRequest deleteByQueryRequest, RequestOptions options, ActionListener<BulkByScrollResponse> listener) from classorg.elasticsearch.client.RestHighLevelClient. So it gonna make a bulk request or delete by query ? Also in case of deleting more than 10K in very few cases close to 1K records which will be good ? delete_by_query or _bulk ?Charlenacharlene
And there is one more function for sync request which is deleteByQuery which returns BulkByScrollResponse so the confusion raised where it's _bulk delete or delete_by_queryCharlenacharlene
I understand your confusion. The _delete_by_query endpoint will internally performs bulk requests to delete documents efficiently, but they are definitely different endpoints.Cranio
Ok, Got it. I've modified the question as suggested. Thanks.Charlenacharlene
Thanks, but can you write delete_by_query instead of delete in the title to avoid confusion and help other users finding this question ? delete is also another endpoint. I'm writing a complete answer to your question.Cranio
Sure. I'll do that.Charlenacharlene
C
5

1. As in doc: It’s also possible to limit the number of processed documents by setting size. request.setSize(10); What does processed document means ? Will it delete only 10 documents ?

If you have not already you should read the search/_scroll documentation. _delete_by_query performs a scroll search using the query given as parameter.

The size parameter corresponds to the number of documents returned by each call to the scroll endpoint. If you have 10 documents matching your query and a size of 2, elasticsearch will internally performs 5 search/_scroll calls (i.e., 5 batches) while if you set a size to 5, only 2 search/_scroll calls will be performed.

Regardless of the size parameter all documents matching the query will be removed but it will be more or less efficient.

2. What batch size I should set ? request.setBatchSize(100); it's performance is based on how many documents we are going to delete ?

setBatchSize() method is equivalent to set the size parameter in the query. You can read this article to determine the correct value for the size parameter.

3. Should I first make a call to get no of documents & based on that setBatchSize should be changed ?

You would have to run the search request twice to get the number of deleted documents, I believe that it would not be efficient. I advise you to find and stick to a constant value.

4. Slices should be depend on how many cores executor machine have or on no of cores in elastic cluster ?

The number of slice should be set from the elasticsearch cluster configuration. It also to parallelize the search both between the shards and within the shards.

You can read the documentation for hints on how to set this parameter. Usually the number of shards for your index.

5. In documentation the method setSlices(2) is given which I'm unable to find in class org.elasticsearch.index.reindex.DeleteByQueryRequest. What I'm missing here ?

You are right, that is probably an error in the documentation. I have never tried it, but I believe you should use forSlice(TaskId slicingTask, SearchRequest slice, int totalSlices).

6. Let's consider if I'm executing this delete query in async mode which is taking 0.5-1.0 sec, meanwhile if I'm doing get request on this index, will it give some exception ? Also in the same time if I inserted new document & retrieving the same, will it be able to give response ?

First, as stated in the documentation, the _delete_by_query endpoint create a snapshot of the index and work on this copy.

For a get request, it depends if the document has already been deleted or not. It will never send an exception, you will just have the same result has if you where retrieving an existing or a non existing document. Please note that unless you specify a sort in the search query, the order of deletion for the documents is not determined.

If you insert (or update) a document during the processing, this document will not be taken into account by the _delete_by_query endpoint, even if it matches the _delete_by_query query. This is where the snapshot is used. So if you insert a new document, you will be able to retrieve it. Same if you update an existing document, the document will be created again if it has already been deleted or updated but not deleted if it has not been deleted yet.

As a side note, deleted documents will still be searchable (even after the delete_by_query task has finished) until a refresh operation has occurred.

_delete_by_query does not support refresh parameter. The request return mentionned in the documentation for the refresh operation refers to requests that can have a refresh parameter. If you want to force a refresh you can use the _refresh endpoint. By default, refresh operation occur every 1 second. So once the _delete_by_query operation is finished after at most 1 second, the deleted documents will not be searchable.

Cranio answered 5/7, 2019 at 11:10 Comment(8)
Have you thought about using daily indices? Then you could just drop the entire index, which is much cheaper since you'll avoid all the merging happening after big deletes. That approach would be more performant (at least for the deletion) and also easier to implement. There might be tradeoffs around oversharding though.Camerlengo
Thank you so much for detailed answer & documentation link. It cleared a lot of concepts. Batch document makes sense that batch size should be depend on size of the data getting returned. So this has nothing to do with scroll limit, correct ? In delete_by_query the batch limit is used for copying data from current index to it's snapshot, right ?Charlenacharlene
As in this case I'm not setting any refresh value, it means it will be false (the default) ,correct ? But what does this mean The changes made by this request will be made visible at some point after the request returns.(from refresh doc) ? What is request returns here ?Charlenacharlene
@Camerlengo Yes, that options also make sense. But currently I'm not thinking of daily indices unless and until the delete operation will have some fixed periodicity. Just curious to know what tradeoffs will be there ?Charlenacharlene
I have updated the answer with a response for your question about the refresh operation. I am not sure of understanding your question about batch size and scroll limit. What do you mean with scroll limit ?Cranio
@Pierre-NicolasMougel scroll limit I'm referring to "size": 100 to be given while calling _search?scroll api.Charlenacharlene
Ok, then the batch size is the scroll limit.Cranio
Downsides for daily indices: You'll probably have more shards (probably 1 per day). There's a certain overhead for every shard (so keep them low) and your search operations will need to search more shards (probably through an index alias myindex-*)Camerlengo

© 2022 - 2025 — McMap. All rights reserved.