PostgresQL Automating VACUUM FULL for bloated tables
Asked Answered
F

2

7

We have a product using PostgreSQL database server that is deployed at a couple of hundred clients. Some of them have gathered tens of gigabytes of data over the years. So in the next version we will introduce automated cleanup procedures which will gradually archive and DELETE old records during nightly batch jobs.

If I understand correctly, autovacuum will kick in and analyze and reorganize the tuples, so the performance will be like it was when fewer records existed.

The actual disk space will not be released, if I understand correctly, since that only happens with a VACUUM FULL, and that is not triggered by autovacuum.

So I was thinking about an automated process that would do this.

I found the bloat view that is used by nagios check_postgres at http://wiki.postgresql.org/wiki/Show_database_bloat.

Is this view any good? Am I correct in understanding that if the tbloat is > 2, it could use a VACUUM FULL? And if the ibloat is too high, it could use a REINDEX?

Any comments on the following job to run as daily batch job?

  • vacuumdb -Z mydatabase #vacuum with analyze-only
  • select tablename from bloatview order by tbloat desc limit 1
  • vacuumdb -f -t tablename mydatabase
  • select tablename, iname from bloatview order by ibloat desc limit 1
  • reindexdb -t tablename -i iname mydatabase

Off course, I still need to wrap it in a nice perl script in crontab (we're using ubuntu 12), or does postgresql have some sort of scheduler I could do this with?

Or is this total overkill and is there a much simpler procedure?

Fardel answered 18/12, 2012 at 11:30 Comment(1)
vacuumdb -Z is probably not necessary, autovacuum seem to do a pretty good job of keeping the analysis up-to-date.Fardel
H
6

You probably don't need it. It is good to do this once — after first archiving job so you'll get your disk space back but after that your daily archiving job and autovacuum will prevent dead tuples bloat.

Also instead of vacuum full it is often better to run cluster table_name using index_name; analyze table_name. This will reorder rows according to an index. This way related table rows can be saved physically close on disk, which can limit disk seeking (important on classic disk drives, largely irrelevant on SSD) and a number of reads for your typical queries.

And remember that both vacuum full and cluster will make your tables unusable while they run.

Harvison answered 19/12, 2012 at 18:42 Comment(3)
Thanks for the hint. The cleanup process we introduced will be scheduled to run for a couple of hours at night, but will take several nights at some clients before it's finished. I agree it's better to wait until the cleanup is completely finished, and then do the vacuum full or cluster only once during a scheduled downtime. Only we have like 200 installations we need to schedule this for, that's why we want some automation for it. Thanks for the hint on CLUSTER.Fardel
From Postgresql 9.0 and above optimizations on VACUUM FULL made it the proper way to do that. (see: wiki.postgresql.org/wiki/VACUUM_FULL#CLUSTER)Backdate
One would think buffer would always be faster than flash. So there is no sequential prefetch in Postgres?Featherhead
F
3

OK, I worked my way through it.

I simplified/reworked the view to split it up in the following two:

CREATE OR REPLACE VIEW
    bloat_datawidth AS
SELECT
    ns.nspname AS schemaname,
    tbl.oid   AS relid,
    tbl.relname,
    CASE
        WHEN every(avg_width IS NOT NULL)
        THEN SUM((1-null_frac)*avg_width) + MAX(null_frac) * 24
        ELSE NULL
    END AS datawidth
FROM
    pg_attribute att
JOIN
    pg_class tbl
ON
    att.attrelid = tbl.oid
JOIN
    pg_namespace ns
ON
    ns.oid = tbl.relnamespace
LEFT JOIN
    pg_stats s
ON
    s.schemaname=ns.nspname
AND s.tablename = tbl.relname
AND s.inherited=false
AND s.attname=att.attname
WHERE
    att.attnum > 0
AND tbl.relkind='r'
GROUP BY
    1,2,3;

And

CREATE OR REPLACE VIEW
    bloat_tables AS
SELECT
    bdw.schemaname,
    bdw.relname,
    bdw.datawidth,
    cc.reltuples::bigint,
    cc.relpages::bigint,
    ceil(cc.reltuples*bdw.datawidth/current_setting('block_size')::NUMERIC)::bigint AS expectedpages,
    100 - (cc.reltuples*100*bdw.datawidth)/(current_setting('block_size')::NUMERIC*cc.relpages) AS bloatpct
FROM
    bloat_datawidth bdw
JOIN
    pg_class cc
ON
    cc.oid = bdw.relid
AND cc.relpages > 1
AND bdw.datawidth IS NOT NULL;

And the cron job:

#!/bin/bash

MIN_BLOAT=65
MIN_WASTED_PAGES=100
LOG_FILE=/var/log/postgresql/bloat.log
DATABASE=unity-stationmaster
SCHEMA=public

if [[ "$(id -un)" != "postgres" ]]
then
echo "You need to be user postgres to run this script."
exit 1
fi

TABLENAME=`psql $DATABASE -t -A -c "select relname from bloat_tables where bloatpct > $MIN_BLOAT and relpages-expectedpages > $MIN_WASTED_PAGES and schemaname ='$SCHEMA' order by wastedpages desc limit 1"`

if [[ -z "$TABLENAME" ]]
then
echo "No bloated tables." >> $LOG_FILE
exit 0
fi

vacuumdb -v -f -t $TABLENAME $DATABASE >> $LOG_FILE
Fardel answered 19/12, 2012 at 14:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.