What is the recommended way to delete a large number of items from DynamoDB?
Asked Answered
R

9

162

I'm writing a simple logging service in DynamoDB.

I have a logs table that is keyed by a user_id hash and a timestamp (Unix epoch int) range.

When a user of the service terminates their account, I need to delete all items in the table, regardless of the range value.

What is the recommended way of doing this sort of operation (Keeping in mind there could be millions of items to delete)?

My options, as far as I can see are:

A: Perform a Scan operation, calling delete on each returned item, until no items are left

B: Perform a BatchGet operation, again calling delete on each item until none are left

Both of these look terrible to me as they will take a long time.

What I ideally want to do is call LogTable.DeleteItem(user_id) - Without supplying the range, and have it delete everything for me.

Romanaromanas answered 6/2, 2012 at 0:0 Comment(0)
D
80

What I ideally want to do is call LogTable.DeleteItem(user_id) - Without supplying the range, and have it delete everything for me.

An understandable request indeed; I can imagine advanced operations like these might get added over time by the AWS team (they have a history of starting with a limited feature set first and evaluate extensions based on customer feedback), but here is what you should do to avoid the cost of a full scan at least:

  1. Use Query rather than Scan to retrieve all items for user_id - this works regardless of the combined hash/range primary key in use, because HashKeyValue and RangeKeyCondition are separate parameters in this API and the former only targets the Attribute value of the hash component of the composite primary key..
  • Please note that you'll have to deal with the query API paging here as usual, see the ExclusiveStartKey parameter:

Primary key of the item from which to continue an earlier query. An earlier query might provide this value as the LastEvaluatedKey if that query operation was interrupted before completing the query; either because of the result set size or the Limit parameter. The LastEvaluatedKey can be passed back in a new query request to continue the operation from that point.

  1. Loop over all returned items and either facilitate DeleteItem as usual
  • Update: Most likely BatchWriteItem is more appropriate for a use case like this (see below for details).

Update

As highlighted by ivant, the BatchWriteItem operation enables you to put or delete several items across multiple tables in a single API call [emphasis mine]:

To upload one item, you can use the PutItem API and to delete one item, you can use the DeleteItem API. However, when you want to upload or delete large amounts of data, such as uploading large amounts of data from Amazon Elastic MapReduce (EMR) or migrate data from another database in to Amazon DynamoDB, this API offers an efficient alternative.

Please note that this still has some relevant limitations, most notably:

  • Maximum operations in a single request — You can specify a total of up to 25 put or delete operations; however, the total request size cannot exceed 1 MB (the HTTP payload).

  • Not an atomic operation — Individual operations specified in a BatchWriteItem are atomic; however BatchWriteItem as a whole is a "best-effort" operation and not an atomic operation. That is, in a BatchWriteItem request, some operations might succeed and others might fail. [...]

Nevertheless this obviously offers a potentially significant gain for use cases like the one at hand.

Dextrorotation answered 6/2, 2012 at 11:30 Comment(6)
I think it would make sense to use batch delete for the second step (it's "masked" as a batch write operation)Kile
@Kile - thanks much for the hint, this "masked" delete functionality of BatchWriteItem indeed escaped me back then; I've updated the answer accordingly.Dextrorotation
for deleting with BatchWriteItem items need to be specified via TableWriteItemsSettee
The link to BatchWriteItem is now docs.aws.amazon.com/amazondynamodb/latest/APIReference/…Squad
I realise this is old, and the OP didn't mention a specific language SDK, but in Python there is a high level batch_writer() as part of the boto3.resource.Table API that will "automatically handle buffering and sending items in batches. In addition, the batch writer will also automatically handle any unprocessed items and resend them as needed" i.e. it's a wrapper around BatchWriteItem that manages the annoying parts. boto3.amazonaws.com/v1/documentation/api/latest/reference/…Lancewood
For Python, Boto3's batch_writer is the way to go for batch deletes. I put a working example on GitHub https://github.com/awsdocs/aws-doc-sdk-examples.Lemma
G
53

According to the DynamoDB documentation you could just delete the full table.

See below:

"Deleting an entire table is significantly more efficient than removing items one-by-one, which essentially doubles the write throughput as you do as many delete operations as put operations"

If you wish to delete only a subset of your data, then you could make separate tables for each month, year or similar. This way you could remove "last month" and keep the rest of your data intact.

This is how you delete a table in Java using the AWS SDK:

DeleteTableRequest deleteTableRequest = new DeleteTableRequest()
  .withTableName(tableName);
DeleteTableResult result = client.deleteTable(deleteTableRequest);
Gaunt answered 15/4, 2013 at 9:51 Comment(9)
I like this answer too but caution: this could create many tables in your system and we pay per table provision. So, you need to reduce the provisioning after the end of month (if your table is per month) while this table is not deleted.Silvio
This answer is also very nice considering that storing 'dead' rows in dynamo can only make your read throughput worse (it might not, but it can't help). Making a single 'hot' table with a bunch of old 'cold' tables with lower throughput is likely cheaper in the long run than keeping everything in the same table.Chas
agree with this answer, its applied if you need to delete all records form the table, but here the questioner want to delete the user base entries not the whole table.Grosso
Having a separate table table for each user would be expensive given DynamoDB pricing. One table per month would actually make things worse. This is clearly an answer for a different, very specific problem.Leannaleanne
Deleting the table may also not be an attractive option if you use automated provisioning such as CloudFormation to manage your table as part of a stack. I'm not aware of a simple way to make CloudFormation recreate a table that you deleted by hand.Trespass
This approach takes quite a bit of time to delete and recreate (when needed) the table, making it unavailable during the whole time. The question clearly states removing user data, which would be impractical spliting into separate, per-user tables.Leannaleanne
I believe deleting and recreating a table would stop the monitoring of changes by any lambda's (or other systems) that are using the table ARNShrubby
@SergioMCFigueiredo This can be easily taken care of by enabling auto scaling of read and write capacity.Tribulation
Deleting the whole table is like hitting a fly with a sledge hammer, how about the 90% use case which is deleting all items with the same given hash key but leaving the rest of the items untouched?Curator
A
25

If you want to delete items after some time, e.g. after a month, just use Time To Live option. It will not count write units.

In your case, I would add ttl when logs expire and leave those after a user is deleted. TTL would make sure logs are removed eventually.

When Time To Live is enabled on a table, a background job checks the TTL attribute of items to see if they are expired.

DynamoDB typically deletes expired items within 48 hours of expiration. The exact duration within which an item truly gets deleted after expiration is specific to the nature of the workload and the size of the table. Items that have expired and not been deleted will still show up in reads, queries, and scans. These items can still be updated and successful updates to change or remove the expiration attribute will be honored.

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TTL.html https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/howitworks-ttl.html

Alber answered 13/6, 2018 at 20:59 Comment(3)
adding TTL is an "update" (write operation). I'm not sure there is any gain for doing an "update" instead of a "delete".Photographic
you can have that data inserted with original write and updated with any other update action. Of course, it's not an option if you have a bunch of data and then you want to delete it. But this is a valid option for cases where you can have ttl for the data you insert or update.Alber
I agree, if there is already TTL configured and cleanup can wait up to 48 hours that's definitely the optimal option. My apologies if I was unclear.Photographic
S
7

The answer of this question depends on the number of items and their size and your budget. Depends on that we have following 3 cases:

1- The number of items and size of items in the table are not very much. then as Steffen Opel said you can Use Query rather than Scan to retrieve all items for user_id and then loop over all returned items and either facilitate DeleteItem or BatchWriteItem. But keep in mind you may burn a lot of throughput capacity here. For example, consider a situation where you need delete 1000 items from a DynamoDB table. Assume that each item is 1 KB in size, resulting in Around 1MB of data. This bulk-deleting task will require a total of 2000 write capacity units for query and delete. To perform this data load within 10 seconds (which is not even considered as fast in some applications), you would need to set the provisioned write throughput of the table to 200 write capacity units. As you can see its doable to use this way if its for less number of items or small size items.

2- We have a lot of items or very large items in the table and we can store them according to the time into different tables. Then as jonathan Said you can just delete the table. this is much better but I don't think it is matched with your case. As you want to delete all of users data no matter what is the time of creation of logs, so in this case you can't delete a particular table. if you wanna have a separate table for each user then I guess if number of users are high then its so expensive and it is not practical for your case.

3- If you have a lot of data and you can't divide your hot and cold data into different tables and you need to do large scale delete frequently then unfortunately DynamoDB is not a good option for you at all. It may become more expensive or very slow(depends on your budget). In these cases I recommend to find another database for your data.

Subsocial answered 26/5, 2016 at 7:16 Comment(0)
M
3

We don't have option to truncate dynamo tables. we have to drop the table and create again . DynamoDB Charges are based on ReadCapacityUnits & WriteCapacityUnits . If we delete all items using BatchWriteItem function, it will use WriteCapacityUnits.So better to delete specific records or delete the table and start again .

Margitmargo answered 6/6, 2018 at 15:7 Comment(0)
C
3

So just an update, there is a release on DynamoDB Console which includes a new feature called PartiQL editor. Its a SQL-like editor for DynamoDB operations.

Delete Specific Record

DELETE FROM <Table-Name> WHERE id=some-Id;

Con: Can only delete one item at a time

Chisel answered 26/7, 2021 at 15:34 Comment(1)
(((Con: Can only delete one item at a time))) i lol'dFrieder
A
3

Here's a recursive function I'm using to delete all items with batchWriteItems. Define your table' key schema and table name and call clearTable:

var AWS = require("aws-sdk");
var docClient = new AWS.DynamoDB.DocumentClient();

const TABLE_NAME = ""
const TABLE_PRIMARY_KEY = ""

const clearTable = async () => {

    const batch = await getItemBatch();

    await recursiveDeleteTableItems(batch)

}

const recursiveDeleteTableItems = async (batch) => {

    if(batch && batch.length > 0) {
      await deleteItemBatch(batch)
    } else {
      return
    }

    const newItemBatch = await getItemBatch()

    await recursiveDeleteTableItems(newItemBatch)

}

const deleteItemBatch = async (batch) => {

   const deleteOperations = batch.map( i => ({ 
     "DeleteRequest": { 
       "Key": { 
        [TABLE_PRIMARY_KEY] : i.KEY_VALUE
       }
     }
   }))

   return new Promise(async (resolve, reject) => {

     const params = {
       "RequestItems": {
         [TABLE_NAME]: deleteOperations
       }
     }

     docClient.batchWrite(params, (err, data) => {

       if (err) {
         reject(`Unable to query. Error: ${err} ${JSON.stringify(err, null, 2)}`);
         return
       }

       resolve(data)

      })

    })

}

const getItemBatch = async () => {

  var params = {
    TableName: TABLE_NAME,
    Limit: 25 // match batchWriteItem
  };

  return new Promise(async (resolve, reject) => {

    docClient.scan(params, async function (err, data) {

        if (err) {
            reject(`Unable to query. Error: ${err} ${JSON.stringify(err, null, 2)}`);
            return
        }

        resolve(data.Items)

      });
  });

}
Antichrist answered 17/12, 2021 at 11:38 Comment(0)
C
1

My approach to delete all rows from a table i DynamoDb is just to pull all rows out from the table, using DynamoDbs ScanAsync and then feed the result list to DynamoDbs AddDeleteItems. Below code in C# works fine for me.

        public async Task DeleteAllReadModelEntitiesInTable()
    {
        List<ReadModelEntity> readModels;

        var conditions = new List<ScanCondition>();
        readModels = await _context.ScanAsync<ReadModelEntity>(conditions).GetRemainingAsync();

        var batchWork = _context.CreateBatchWrite<ReadModelEntity>();
        batchWork.AddDeleteItems(readModels);
        await batchWork.ExecuteAsync();
    }

Note: Deleting the table and then recreating it again from the web console may cause problems if using YAML/CloudFormation to create the table.

Calv answered 23/1, 2019 at 14:32 Comment(0)
M
1

Thought about using the test to pass in the vars? Something like:

Test input would be something like:

{
  "TABLE_NAME": "MyDevTable",
  "PARTITION_KEY": "REGION",
  "SORT_KEY": "COUNTRY"
}

Adjusted your code to accept the inputs:

const AWS = require('aws-sdk');
const docClient = new AWS.DynamoDB.DocumentClient({ apiVersion: '2012-08-10' });

exports.handler = async (event) => {
    const TABLE_NAME = event.TABLE_NAME;
    const PARTITION_KEY = event.PARTITION_KEY;
    const SORT_KEY = event.SORT_KEY;
    let params = {
        TableName: TABLE_NAME,
    };
    console.log(`keys: ${PARTITION_KEY} ${SORT_KEY}`);

    let items = [];
    let data = await docClient.scan(params).promise();
    items = [...items, ...data.Items];
    
    while (typeof data.LastEvaluatedKey != 'undefined') {
        params.ExclusiveStartKey = data.LastEvaluatedKey;

        data = await docClient.scan(params).promise();
        items = [...items, ...data.Items];
    }

    let leftItems = items.length;
    let group = [];
    let groupNumber = 0;

    console.log('Total items to be deleted', leftItems);

    for (const i of items) {
        // console.log(`item: ${i[PARTITION_KEY] } ${i[SORT_KEY]}`);
        const deleteReq = {DeleteRequest: {Key: {},},};
        deleteReq.DeleteRequest.Key[PARTITION_KEY] = i[PARTITION_KEY];
        deleteReq.DeleteRequest.Key[SORT_KEY] = i[SORT_KEY];

        // console.log(`DeleteRequest: ${JSON.stringify(deleteReq)}`);
        group.push(deleteReq);
        leftItems--;

        if (group.length === 25 || leftItems < 1) {
            groupNumber++;

            console.log(`Batch ${groupNumber} to be deleted.`);

            const params = {
                RequestItems: {
                    [TABLE_NAME]: group,
                },
            };

            await docClient.batchWrite(params).promise();

            console.log(
                `Batch ${groupNumber} processed. Left items: ${leftItems}`
            );

            // reset
            group = [];
        }
    }

    const response = {
        statusCode: 200,
        //  Uncomment below to enable CORS requests
        headers: {
            "Access-Control-Allow-Origin": "*"
        },
        body: JSON.stringify('Hello from Lambda!'),
    };
    return response;
};
Marston answered 27/11, 2020 at 8:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.