MySQL - Big DELETEs on multiple tables
Asked Answered
G

5

6

I have 7 related tables and on one of the tables, there is a timestamp column and I want to delete all rows older than 30 days. However, these are VERY big deletes. I'm talking tens of millions of records. If I delete all these records from the main table, I have to look at the other 6 tables and delete the associated records from those tables as well.

My question is what is the best way to optimize this?

I'm thinking of using PARTITION but only one table has the timestamp column. I'm worried that if I drop the old partition in the main table, the related records will still exist in the other 6 tables. The related records are related by the fields (sid, cid).

For context I'm using snort and barnyard which are IDS processors.

I'm using MySQL 5.1.73, MyISAM tables

Here is a snippet from cleanup logs:

StartTime,EndTime,TimeElapsed,AffectedRows
Wed Jan 6 01:00:01 EST 2016,Wed Jan 6 01:45:11 EST 2016,45:10,2911807
Thu Jan 7 01:00:02 EST 2016,Thu Jan 7 01:25:29 EST 2016,25:27,2230255
Fri Jan 8 01:00:01 EST 2016,Fri Jan 8 01:24:18 EST 2016,24:17,1400470
Sat Jan 9 01:00:02 EST 2016,Sat Jan 9 05:47:10 EST 2016,287:8,23360088
Sun Jan 10 01:00:01 EST 2016,Sun Jan 10 10:06:16 EST 2016,546:15,44970072
Mon Jan 11 01:00:01 EST 2016,Mon Jan 11 09:40:39 EST 2016,520:38,43948091

This was my old cleanup script:

/usr/bin/mysql --defaults-extra-file=/old/.my.cnf snort_db >> /root/snortcleaner.log 2>&1 <<EOF
use snort_db;

DROP TRIGGER IF EXISTS delete_old;

DELIMITER //
CREATE TRIGGER delete_old AFTER DELETE ON event
FOR EACH ROW
BEGIN
DELETE FROM data WHERE data.cid = old.cid AND data.sid = old.sid;
DELETE FROM iphdr WHERE iphdr.cid = old.cid AND iphdr.sid = old.sid;
DELETE FROM icmphdr WHERE icmphdr.cid = old.cid AND icmphdr.sid = old.sid;
DELETE FROM tcphdr WHERE tcphdr.cid = old.cid AND tcphdr.sid = old.sid;
DELETE FROM udphdr WHERE udphdr.cid = old.cid AND udphdr.sid = old.sid;
DELETE FROM opt WHERE opt.cid = old.cid AND opt.sid = old.sid;
END //
DELIMITER ;

EOF

# Send the main MySQL command: Deletes all records betweeen the oldest     timestamp and 31 days from now()
# Gets the oldest timestamp and ranges a deletion from that to 31 days before now(). If the oldest timestamp is more recent than 31 days, the following command returns 0 anyway. If it is older than 31 days, it will return them
OLDEST_TIMESTAMP=$(mysql --defaults-extra-file=/old/.my.cnf -Dsnort_db -se "SELECT timestamp FROM event ORDER BY timestamp ASC LIMIT 1;")
NUM_AFFECTED=$(mysql --defaults-extra-file=/old/.my.cnf -Dsnort_db -se "DELETE FROM event WHERE timestamp BETWEEN DATE_SUB('${OLDEST_TIMESTAMP}', INTERVAL 1 HOUR) AND DATE_SUB(NOW(), INTERVAL 31 DAY); SELECT ROW_COUNT();")

This is my current cleanup script:

DELETE FROM event WHERE timestamp BETWEEN DATE_SUB('${OLDEST_TIMESTAMP}', INTERVAL 1 HOUR) AND DATE_SUB(NOW(), INTERVAL 31 DAY);

DELETE FROM data USING data LEFT OUTER JOIN event USING (sid,cid) WHERE event.sid IS NULL;
DELETE FROM iphdr USING iphdr LEFT OUTER JOIN event USING (sid,cid) WHERE event.sid IS NULL;
DELETE FROM icmphdr USING icmphdr LEFT OUTER JOIN event USING (sid,cid) WHERE event.sid IS NULL;
DELETE FROM tcphdr USING tcphdr LEFT OUTER JOIN event USING (sid,cid) WHERE event.sid IS NULL;
DELETE FROM udphdr USING udphdr LEFT OUTER JOIN event USING (sid,cid) WHERE event.sid IS NULL;
DELETE FROM opt USING opt LEFT OUTER JOIN event USING (sid,cid) WHERE event.sid IS NULL;

I switch back and forth between the two because I don't know which is faster but the reality is that both are too slow.

Goulder answered 11/1, 2016 at 17:3 Comment(7)
Are there foreign keys from the "other" tables to your main table?Equivalency
InnoDB? MyISAM? table definitions?Mister
Yes, there are foreign keys but the other tables would have the same number of rows. They are MyISAM tables.Goulder
In proportion, how many % of the event table are you deleting every day?Amandine
My other question is: why do you need to delete these records?Amandine
I would reckon 1/30th of the table is being deleted every day. I delete them because I need space for more records.Goulder
I have 2 servers with the same issue. I continued to use triggers in one server and it's manageable enough. On the other server, I updated to InnoDB and used ON DELETE CASCADEs on foreign keys. Both do reasonably well only because for triggers, the server buffers all incoming data even when the table is locked. Then when the table is unlocked, the table is written to like mad. This means we still get all the necessary data.Goulder
M
0

Try settting your foreign keys to cascade on delete, so you won't need to create the trigger and manually join and delete the related records.

Below example shows how to create a relationship that cascades the delete

CREATE TABLE parent ( id INT NOT NULL, PRIMARY KEY (id) ) ENGINE=INNODB; CREATE TABLE child ( id INT, parent_id INT, INDEX par_ind (parent_id), FOREIGN KEY (parent_id) REFERENCES parent(id) ON DELETE CASCADE ) ENGINE=INNODB;

Example from Mysql website

Mac answered 11/1, 2016 at 21:37 Comment(5)
I'm using a MyISAM table but if I were to convert to a InnoDB table, would cascade deletions be faster than trigger deletions?Goulder
Unsure, I'd probably try it to rule it out then try the partition approachMac
Let us know what's works fastest, interesting questionMac
As it turns out, I don't think I can use cascade deletions: #1457513. I have more than 1 related table.Goulder
That post is Microsoft sql, not Mysql. I think you can with mysqlMac
A
0

We solved problem like this with creating and dropping partitions. So 1st you create partitions by date in your table (best practice - automation with MySql events), and when you need to delete old data - simply drop some partitions - operation will be instantly, without any delay or lock.

Annadiane answered 11/1, 2016 at 21:43 Comment(1)
That's the route I'm most interested in but I don't know how to manage the 6 related tables. There is no field to partition by in the other tables.Goulder
S
0

How about saving the ids of the rows you are going to delete into a temporary table before deleting them.

Then you can switch your cleanup script from joining on a large table where id = null to joining on a small(er) table where id <> null.

Scholastic answered 11/1, 2016 at 21:58 Comment(2)
And with MySQL 5.6+ you can specify the partition to join to, avoiding the need to copy data into a temporary table. ... inner join the_big_table partition(partition_name) on ...Fair
Would this be significantly faster? I can definitely try this.Goulder
E
0

I would do two things:

Define the foreign keys in the other tables with

ON DELETE CASCADE

and rather than nibble away at the rows hour by hour, add a LIMIT to simple delete

DELETE FROM event
WHERE timestamp < DATE_SUB(NOW(), INTERVAL 31 DAY)
LIMIT 500000

And keep rerunning it until there are no rows affected or as many times as experience tells you is needed.

Tune the 500000 to be as big as you can make it without the query dying.

Equivalency answered 12/1, 2016 at 17:3 Comment(3)
From what I've seen, ON DELETE CASCADE is only on InnoDB tables. I'm using MyISAM tables. I've heard of that method of deleting in chunks. I don't understand the performance enhancement from it though. How is it faster/more efficient?Goulder
Is event.cid incremental (always increasing in value)?Equivalency
I don't think so. Not necessarily.Goulder
E
0

Change your script to:

  • ensure there is an index on cid for all tables
  • capture the cid values you're about to delete from event
  • rather than targeting all old rows. target (up to) a (small) maximum amount of old rows, so it executes relatively quickly
  • run the script frequently (say every 5 minutes, every hour, every day, whatever makes sense)

Something like:

CREATE TABLE IF NOT EXISTS deleted_cids(int cid); -- ensure same datatype as cid in tables
TRUNCATE deleted_cids;
INSERT INTO deleted_cids
SELECT cid FROM event
WHERE timestamp BETWEEN DATE_SUB('${OLDEST_TIMESTAMP}', INTERVAL 1 HOUR)
  AND DATE_SUB(NOW(), INTERVAL 31 DAY)
LIMIT 100000; -- Choose largest LIMIT that gives acceptable execution time
DELETE event FROM deleted_cids, event WHERE event.cid = deleted_cids.cid;
DELETE data FROM deleted_cids, data WHERE data.cid = deleted_cids.cid;
DELETE iphdr FROM deleted_cids, iphdr WHERE iphdr.cid = deleted_cids.cid;
DELETE icmphdr FROM deleted_cids, icmphdr WHERE icmphdr.cid = deleted_cids.cid;
DELETE tcphdr FROM deleted_cids, tcphdr WHERE tcphdr.cid = deleted_cids.cid;
DELETE udphdr FROM deleted_cids, udphdr WHERE udphdr.cid = deleted_cids.cid;
DELETE opt FROM deleted_cids, opt WHERE opt.cid = deleted_cids.cid;

The advantage here is that each deletion is an indexed-based, single execution to delete all targeted rows - it should execute quickly.

By tuning the LIMIT and frequency of execution, you can find the right balance of server load. I would opt for frequent executions of smaller amounts, so your server is never brought to a grinding halt by the process.

Equivalency answered 16/1, 2016 at 17:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.