Move rows older that x days to archive table or partition table in Postgres 11
F

1

5

I would like to speed up the queries on my big table that contains lots of old data.

I have a table named post that has the date column created_at. The table has over ~31 million rows and ~30 million rows older than 30 days.

Actually, I want this:

  • move data older than 30 days into the post_archive table or create a partition table.
  • when the value in column created_at becomes older than 30 days then that row should be moved to the post_archive table or partition table.

Any detailed and concrete solution in PostgresSQL 11.15?

My ideas:

  • Solution 1. create a cron script in whatever language (e.g. JavaScript) and run it every day to copy data from the post table into post_archive and then delete data from the post table
  • Solution 2. create a Postgres function that should copy the data from the post table into the partition table, and create a cron job that will call the function every day

Thanks

Flosser answered 3/6, 2022 at 9:43 Comment(3)
I wouldn't move anything, no need for. Just partition the table by date (month or so) and just SELECT the data that you need: WHERE created_at >= CURRENT_DATE - INTERVAL '30 DAYS';Pekingese
Thanks for the answer. Now I have these questions: When the value in column "created_at" in table "post" becomes older than 30 days then that row should be moved to the partition table. How to solve this automaticaly? Also how to move existing old (30 million) data to the partition table?Flosser
When your table "post" is partitioned by month, a record stays in that partition forever. Don't move it, waist of IO resources. Because of the WHERE condition in your SELECT statements, the partitions that hold data that is too old, will not be used anymore.Pekingese
U
7

This is to split your data into a post and post_archive table. It's a common approach, and I've done it (with SQL Server).

Before you do anything else, make sure you have an index on your created_at column on your post table. Important.

Next, you need to use a common expression to mean "thirty days ago". This is it.

 (CURRENT_DATE - INTERVAL '30 DAY')::DATE

Next, back everything up. You knew that.

Then, here's your process to set up your two tables.

  1. CREATE TABLE post_archive AS TABLE post; to populate your archive table.

  2. Do these two steps to repopulate your post table with the most recent thirty days. It will take forever to DELETE all those rows, so we'll truncate the table and repopulate it. That's also good because it's like starting from scratch with a much smaller table, which is what you want. This takes a modest amount of downtime.

    TRUNCATE TABLE post;
    INSERT INTO post SELECT * FROM post_archive
     WHERE created_at > (CURRENT_DATE - INTERVAL '30 DAY')::DATE;
    
  3. DELETE FROM post_archive WHERE created_at > (CURRENT_DATE - INTERVAL '30 DAY')::DATE; to remove the most recent thirty days from your archive table.

Now, you have the two tables.

Your next step is the daily row-migration job. PostgreSQL lacks a built-in job scheduler like SQL Server's Job or MySQL's EVENT so your best bet is a cronjob.

It's probably wise to do the migration daily if that fits with your business rules. Why? Many-row DELETEs and INSERTs cause big transactions, and that can make your RDBMS server thrash. Smaller numbers of rows are better.

The SQL you need is something like this:

INSERT INTO post_archive SELECT * FROM post 
 WHERE created_at <= (CURRENT_DATE - INTERVAL '30 DAY')::DATE; 
DELETE FROM post 
 WHERE created_at <= (CURRENT_DATE - INTERVAL '30 DAY')::DATE;

You can package this up as a shell script. On UNIX-derived systems like Linux and FreeBSD the shell script file might look like this.

#!/bin/sh
psql postgres://username:password@hostname:5432/database << SQLSTATEMENTS
INSERT INTO post_archive SELECT * FROM post 
 WHERE created_at <= (CURRENT_DATE - INTERVAL '30 DAY')::DATE; 
DELETE FROM post 
 WHERE created_at <= (CURRENT_DATE - INTERVAL '30 DAY')::DATE;
SQLSTATEMENTS

Then run the shell script from cron a few minutes after 3am each day.

Some notes:

  • 3am? Why? In many places daylight-time switchover messes up the time between 02:00 and 03:00 twice a year. A choice of, say 03:22 as a time to run the daily migration keeps you well away from that problem.

  • CURRENT_DATE gets you midnight of today. So, if you run the script more than once in any calendar day, no harm is done.

  • If you miss a day, the next day's migration will catch up.

  • You could package up the SQL as a stored procedure and put it into your RDBMS, then invoke it from your shell script. But then your migration procedure lives in two different places. You need the cronjob and shell script in any case in PostgreSQL.

  • Will your application go off the rails if it sees identical rows in both post and post_archive while the migration is in progress? If so, you'll need to wrap your SQL statements in a transaction. That way other users of the database won't see the duplicate rows. Do this.

    #!/bin/sh
    psql postgres://username:password@hostname:5432/database << SQLSTATEMENTS
    START TRANSACTION;
    INSERT INTO post_archive SELECT * FROM post 
      WHERE created_at <= (CURRENT_DATE - INTERVAL '30 DAY')::DATE; 
    DELETE FROM post 
      WHERE created_at <= (CURRENT_DATE - INTERVAL '30 DAY')::DATE;
    COMMIT;
    SQLSTATEMENTS
    

Cronjobs are quite reliable on Linux and FreeBSD.

Upgrowth answered 3/6, 2022 at 18:35 Comment(7)
Really detailed and concrete answer. I will go with this solution.Flosser
It's customary to "accept" workable answers. That helps other people with similar questions discover them.Upgrowth
one more question: DELETE won't free up the space. I know that VACUM can help or (not 100% sure) restart db. Any advice how to deal with this?Flosser
The large amount of unreclaimed tablespace is one reason for the TRUNCATE TABLE step when setting this up. May I suggest you ask other questions, either here or on dba.stackexchange.com , to get more information about this?Upgrowth
Sure. Will ask on dba.stack. Because we have a cron job that will execute daily row migrations, copy rows to the archive table and delete those rows from the post table and then I have to somehow reclaim the space.Flosser
Great ... could you maybe post another comment here when you do? I'm interested to see how your project unfolds if you're willing to share that data.Upgrowth
@Flosser Please mark the answer as accepted since it has addressed your question well.Reyreyes

© 2022 - 2024 — McMap. All rights reserved.