AWS glue delete all partitions
Asked Answered
Y

4

5

I defined several tables in AWS glue.

Over the past few weeks, I've had different issues with the table definition which I had to fix manually - I want to change column names, or types, or change the serialization lib. However, If i already have partitions created, the repairing of table doesn't change them, and so I have to delete all partitions manually and then repairing.

Is there a simple way to do this? Delete all partitions from an AWS Glue table? I'm using aws batch-delete-partition CLI command, but it's syntax is tricky, and there are some limitations on the amount of partitions you can delete in one go, the whole thing is cumbersome...

Yeung answered 30/3, 2020 at 9:35 Comment(2)
using boto3 with python might be the more elegant solution you are looking for, any reason for sticking with aws cli?Acidic
@Acidic No reason really... I really expected this to be an easy thing to do, by the AWS CLI or otherwise, without having to write code. I'd rather not write python scripts if I can avoid it. If you can post your python code it'll helpYeung
Y
16

For now, I found this command line solution, runinng aws glue batch-delete-partition iteratively for batches of 25 partitions using xargs

(here I am assuming there are max 1000 partitions):

aws glue get-partitions --database-name=<my-database> --table-name=<my-table> | jq -cr '[ { Values: .Partitions[].Values } ]' > partitions.json

seq 0 25 1000 | xargs -I _ bash -c "cat partitions.json | jq -c '.[_:_+25]'" | while read X; do aws glue batch-delete-partition --database-name=<my-database> --table-name=<my-table > --partitions-to-delete=$X; done

Hope it helps someone, but I'd prefer a more elegant solution

Yeung answered 30/3, 2020 at 11:32 Comment(1)
Very useful. To process more than 1k partitions simply bump up the 1000 in the second line of the script.Whop
A
10

Using python3 with boto3 looks a little bit nicer. Albeit not by much :)

Unfortunately AWS doesn't provide a way to delete all partitions without batching 25 requests at a time. Note that this will only work for deleting the first page of partitions retrieved.

import boto3

glue_client = boto3.client("glue", "us-west-2")

def get_and_delete_partitions(database, table, batch=25):
    partitions = glue_client.get_partitions(
        DatabaseName=database,
        TableName=table)["Partitions"]

    for i in range(0, len(partitions), batch):
        to_delete = [{k:v[k]} for k,v in zip(["Values"]*batch, partitions[i:i+batch])]
        glue_client.batch_delete_partition(
            DatabaseName=database,
            TableName=table,
            PartitionsToDelete=to_delete)

EDIT: To delete all partitions (beyond just the first page) using paginators makes it look cleaner.

import boto3

glue_client = boto3.client("glue", "us-west-2")

def delete_partitions(database, table, partitions, batch=25):
    for i in range(0, len(partitions), batch):
      to_delete = [{k:v[k]} for k,v in zip(["Values"]*batch, partitions[i:i+batch])]
      glue_client.batch_delete_partition(
        DatabaseName=database,
        TableName=table,
        PartitionsToDelete=to_delete)

def get_and_delete_partitions(database, table):
    paginator = glue_client.get_paginator('get_partitions')
    itr = paginator.paginate(DatabaseName=database, TableName=table)
    
    for page in itr:
      delete_partitions(database, table, page["Partitions"])
Acidic answered 2/4, 2020 at 19:4 Comment(1)
worked beautifully! thank youJackofalltrades
P
2

Here is a PowerShell version FWIW:

$database = 'your db name'
$table = 'your table name'
# Set the variables above

$batch_size = 25
Set-DefaultAWSRegion -Region eu-west-2
$partition_list = Get-GLUEPartitionList -DatabaseName $database -TableName $table

$selected_partitions = $partition_list
# Uncomment and edit predicate to select only certain partitions
# $selected_partitions = $partition_list | Where-Object {$_.Values[0] -gt '2020-07-20'} 

$selected_values = $selected_partitions | Select-Object -Property Values

for ($i = 0; $i -lt $selected_values.Count; $i += $batch_size) {
    $chunk = $selected_values[$i..($i + $batch_size - 1)]
    Remove-GLUEPartitionBatch -DatabaseName $database -TableName $table -PartitionsToDelete $chunk -Force
}

# Now run `MSCK REPAIR TABLE db_name.table_name` to add the partitions again
Paisley answered 20/8, 2020 at 22:31 Comment(0)
M
0

Here is a javascript version designed to run in a Lambda. Tested and functional in Lambda Node v16.

const AWS = require('aws-sdk');

// Set the region and initialize the Glue client
AWS.config.update({ region: 'us-east-1' });
const glue = new AWS.Glue();

exports.handler = async (event, context) => {

    // Specify the database and table names
    const databaseName = event.database;
    const tableName = event.table;

    console.log("database:", databaseName);
    console.log("table:", tableName);

    
    async function getAllPartitions(databaseName, tableName) {
        let partitions = [];
        let nextToken;

        do {
            const params = {
                DatabaseName: databaseName,
                TableName: tableName,
                NextToken: nextToken,
            };

            const response = await glue.getPartitions(params).promise();
            partitions = partitions.concat(response.Partitions || []);
            nextToken = response.NextToken;
        } while (nextToken);

        return partitions;
    }

    try {
        // Get a list of partitions for the table
        const partitions = await getAllPartitions(databaseName, tableName);
        console.log(`retrieved ${partitions.length} partitions`);

        // Delete all partitions in batches of 25
        const partitionBatches = [];
        for (let i = 0; i < partitions.length; i += 25) {
            partitionBatches.push(partitions.slice(i, i + 25));
        }

        await Promise.all(partitionBatches.map((partitionBatch) => {

            let partitionValuesArray = [];
            partitionBatch.forEach((partition => {
                partitionValuesArray.push({ Values: partition.Values });
            }));

            glue.batchDeletePartition({
                DatabaseName: databaseName,
                TableName: tableName,
                PartitionsToDelete: partitionValuesArray,
            }).promise();
        }));

        console.log(`Deleted ${partitions.length} partitions`);
    }
    catch (err) {
        console.error(err);
    }

};

Just swap out the event variables with the names manually, or set up test events to change targets. Or can be triggered programmatically however you wish.

Recommend increasing default timeout for large datasets. IAM role for Lambda requires Glue permissions.

Mincey answered 1/5, 2023 at 22:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.