SQL Batched Delete

Asked 22/5, 2009 at 8:8 Answered 4/9, 2014 at 1:46

Solved sql sql-server t-sql sql-server-administration

I have a table in SQL Server 2005 which has approx 4 billion rows in it. I need to delete approximately 2 billion of these rows. If I try and do it in a single transaction, the transaction log fills up and it fails. I don't have any extra space to make the transaction log bigger. I assume the best way forward is to batch up the delete statements (in batches of ~ 10,000?).

I can probably do this using a cursor, but is the a standard/easy/clever way of doing this?

P.S. This table does not have an identity column as a PK. The PK is made up of an integer foreign key and a date.

Coplanar answered 22/5, 2009 at 8:8 Comment(1)

Mmm, sounds like historic/periodic data... – Shush 22/5, 2009 at 14:55

You can 'nibble' the delete's which also means that you don't cause a massive load on the database. If your t-log backups run every 10 mins, then you should be ok to run this once or twice over the same interval. You can schedule it as a SQL Agent job

try something like this:

DECLARE @count int
SET @count = 10000

    DELETE  FROM table1 
    WHERE table1id IN (
        SELECT TOP (@count) tableid
        FROM table1
        WHERE x='y'
    )

Scarabaeus answered 22/5, 2009 at 12:6 Comment(0)

What distinguishes the rows you want to delete from those you want to keep? Will this work for you:

while exists (select 1 from your_table where <your_condition>)
delete top(10000) from your_table
where <your_condition>

Neves answered 22/5, 2009 at 8:25 Comment(5)

The where condition would basically be: WHERE DateTimeInserted < DATEDIFF(d, GETDATE(), 5). I can give this a try, but I'm concerned that because this will loop ~200,000 times, the select statement will execute 200,000 times and on a 2-4 billion row table it could take up to half an hour (from experience). If this is the case, the statement will take 11 years to run! :) – Coplanar 22/5, 2009 at 8:37

It will still log the deletes, even in batches, filling up the transaction log. – Ravens 22/5, 2009 at 8:38

You can skip the select if you think it's expensive (just replace it with some simpler exit condition). As for transaction log growth, I think you can do some tricks with checkpoints within the loop with "truncate on checkpoint" option turned on. – Neves 22/5, 2009 at 8:44

I could be wrong, but I believe that if you have simple recovery mode, the space in the transaction log will be recovered after each transaction is completed. Therefore I don't think I need to do any truncating/shrinking etc. – Coplanar 22/5, 2009 at 8:58

Rather than checking the IF EXISTS you could use WHILE (1=1) and after the DELETE use IF @@ROWCOUNT = 0 BREAK – Virago 22/5, 2009 at 12:1

In addition to putting this in a batch with a statement to truncate the log, you also might want to try these tricks:

Add criteria that matches the first column in your clustered index in addition to your other criteria
Drop any indexes from the table and then put them back after the delete is done if that's possible and won't interfere with anything else going on in the DB, but KEEP the clustered index

For the first point above, for example, if your PK is clustered then find a range which approximately matches the number of rows that you want to delete each batch and use that:

DECLARE @max_id INT, @start_id INT, @end_id INT, @interval INT
SELECT @start_id = MIN(id), @max_id = MAX(id) FROM My_Table
SET @interval = 100000  -- You need to determine the right number here
SET @end_id = @start_id + @interval

WHILE (@start_id <= @max_id)
BEGIN
     DELETE FROM My_Table WHERE id BETWEEN @start_id AND @end_id AND <your criteria>

     SET @start_id = @end_id + 1
     SET @end_id = @end_id + @interval
END

Virago answered 22/5, 2009 at 12:15 Comment(0)

Sounds like this is one-off operation (I hope for you) and you don't need to go back to a state that's halfway this batched delete - if that's the case why don't you just switch to SIMPLE transaction mode before running and then back to FULL when you're done?

This way the transaction log won't grow as much. This might not be ideal in most situations but I don't see anything wrong here (assuming as above you don't need to go back to a state that's in between your deletes).

you can do this in your script with smt like:

ALTER DATABASE myDB SET RECOVERY FULL/SIMPLE

Alternatively you can setup a job to shrink the transaction log every given interval of time - while your delete is running. This is kinda bad but I reckon it'd do the trick.

Elonore answered 22/5, 2009 at 8:18 Comment(4)

Yes, it is a one off operation :) Unfortunately, We're already using simple recovery, but even with simple recovery, the tlog (100GB) fills up when doing the delete in a single transaction. – Coplanar 22/5, 2009 at 8:29

What's worth mentioning here though is that it will invalidate any transactional backups by switching over to simple recovery. If that is not being used, then it's fine (and I actually use that way a lot) but otherwise a full or differential backup is needed afterwards to be able to use the transactional backups again. – Nitrobenzene 22/5, 2009 at 8:31

what about the 'alternative' solution/hack? :) – Elonore 22/5, 2009 at 8:55

I'm not convinced that running an intermitent shrink on the transaction log would help (although I haven't tested it). Space from the transaction log cannot be recovered until after the transaction is completed. – Coplanar 27/5, 2009 at 12:58

Well, if you were using SQL Server Partitioning, say based on the date column, you would have possibly switched out the partitions that are no longer required. A consideration for a future implementation perhaps.

I think the best option may be as you say, to delete the data in smaller batches, rather than in one hit, so as to avoid any potential blocking issues.

You could also consider the following method:

Copy the data to keep into a temporary table
Truncate the original table to purge all data
Move everything from the temporary table back into the original table

Your indexes would also be rebuilt as the data was added back to the original table.

Subdivide answered 22/5, 2009 at 8:11 Comment(2)

Thanks for the answer, we have looked in to partitioning, but its not practical for us to implement it at the momentn (paritially due to this issue: support.microsoft.com/kb/924601). Regarding copying data to a temporary table: would ths operation require any less transaction log space than deleting rows? – Coplanar 22/5, 2009 at 8:14

Possibly yes because you would not need to issue a DELETE operation. Once you have created a copy of the table, you would TRUNCATE the source table and then copy only the data you wish to keep back to the source table. I would still recommend that you go with the batch deletion though, as you really want to have all operations logged in order to guarantee the consistency/recoverability of your database. – Subdivide 22/5, 2009 at 8:31

I would do something similar to the temp table suggestions but I'd select into a new permanent table the rows you want to keep, drop the original table and then rename the new one. This should have a relatively low tran log impact. Obviously remember to recreate any indexes that are required on the new table after you've renamed it.

Just my two p'enneth.

Yaw answered 22/5, 2009 at 10:32 Comment(0)

Here is my example:

-- configure script
-- Script limits - transaction per commit (default 10,000)
-- And time to allow script to run (in seconds, default 2 hours)
--
DECLARE @MAX INT
DECLARE @MAXT INT
--
-- These 4 variables are substituted by shell script.
--
SET @MAX = $MAX
SET @MAXT = $MAXT
SET @TABLE = $TABLE
SET @WHERE = $WHERE

-- step 1 - Main loop
DECLARE @continue INT
-- deleted in one transaction
DECLARE @deleted INT
-- deleted total in script
DECLARE @total INT
SET @total = 0
DECLARE @max_id INT, @start_id INT, @end_id INT, @interval INT
SET @interval = @MAX
SELECT @start_id = MIN(id), @max_id = MAX(id) from @TABLE
SET @end_id = @start_id + @interval

-- timing
DECLARE @start DATETIME
DECLARE @now DATETIME
DECLARE @timee INT
SET @start = GETDATE()
-- 
SET @continue = 1
IF OBJECT_ID (N'EntryID', 'U') IS NULL 
BEGIN
    CREATE TABLE EntryID (startid INT)
    INSERT INTO EntryID(startid) VALUES(@start_id)
END
    ELSE
BEGIN
    SELECT @start_id = startid FROM EntryID
END


WHILE (@continue = 1 AND @start_id <= @max_id)
BEGIN

    PRINT 'Start issued:   ' + CONVERT(varchar(19), GETDATE(), 120)
    BEGIN TRANSACTION
        DELETE 
        FROM @TABLE
        WHERE id BETWEEN @start_id AND @end_id AND @WHERE
        SET @deleted = @@ROWCOUNT
    UPDATE EntryID SET EntryID.startid = @end_id + 1
    COMMIT
    PRINT 'Deleted issued: ' + STR(@deleted) + ' records. ' + CONVERT(varchar(19), GETDATE(), 120) 
    SET @total = @total + @deleted
    SET @start_id = @end_id + 1
    SET @end_id = @end_id + @interval
    IF @end_id > @max_id
        SET @end_id = @max_id

    SET @now = GETDATE()
    SET @timee = DATEDIFF (second, @start, @now)
    if @timee > @MAXT
    BEGIN
    PRINT 'Time limit exceeded for the script, exiting'
    SET @continue = 0
    END
--    ELSE
--    BEGIN
--      SELECT @total 'Removed now', @timee 'Total time, seconds'   
--    END
END

SELECT @total 'Removed records', @timee 'Total time sec' , @start_id 'Next id', @max_id 'Max id', @continue 'COMPLETED? '
SELECT * from EntryID next_start_id

GO

Quinidine answered 4/9, 2014 at 1:46 Comment(1)

You may want to add some description beyond the code itself. – Maramarabel 4/9, 2014 at 2:3

The short answer is, you can't delete 2 billion rows without incurring some kind of major database downtime.

Your best option may be to copy the data to a temp table and truncate the original table, but this will fill your tempDB and would use no less logging than deleting the data.

You will need to delete as many rows as you can until the transaction log fills up, then truncate it each time. The answer provided by Stanislav Kniazev could be modified to do this by increasing the batch size and adding a call to truncate the log file.

Ravens answered 22/5, 2009 at 8:41 Comment(0)

I agree with the people who want you loop over a smaller set of records, this will be faster than trying to do the whole operation in one step. You may to experience withthe number of records you should include inthe loop. About 2000 at a time seems to be the sweet spot in most of the tables I do large deltes from althouhg a few need smaller amounts like 500. Depends on number of forign keys, size of the record, triggers etc, so it really will take some experimenting to find what you need. It also depends on how heavy the use of the table is. A heavily accessed table will need each iteration of the loop to run a shorter amount of time. If you can run during off hours, or best yet in single user mode, then you can have more records deleted in one loop.

If you don't think you do this in one night during off hours, it might be best to design the loop with a counter and only do a set number of iterations each night until it is done.

Further, if you use an implicit transaction rather than an explicit one, you can kill the loop query at any time and records already deleted will stay deleted except those in the current round of the loop. Much faster than trying to rollback half a million records becasue you've brought the system to a halt.

It is usually a good idea to backup a database immediately before undertaking an operation of this nature.

Bianchi answered 22/5, 2009 at 14:1 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags