pg_dump and pg_restore on giant databases
Asked Answered
P

1

6

I have currently a task to improve a database-structure. For this we want to effectively dump and restore one single giant database. (approx. 1TB and growing)

To test things with this database, we wanted to transfer this database to another server-node, and this via pg_dump and pg_restore.

We are running a v10 (https://www.postgresql.org/docs/10/app-pgdump.html) Server, so we are limited to their possible parameters. It is also required to dump the full database, and not only parts.

For this I tried a couple of approaches, these sources helped a lot:

and foremost:

The problem is, that you can almost only improve one of these task, but not both simultaneously.

Case 1

Dumping in directory format is extremely fast (~1 hour), but restoring is not.

pg_dump --blobs --dbname="$DBNAME" --file=$DUMPDIR --format=directory --host=$SERVERHOSTNAME --jobs=$THREADS --port=$SERVERPORT--username="$SERVERUSERNAME"
pg_restore --clean --create --format=directory --jobs=$THREADS --host=$SERVERHOSTNAME --port=$SERVERPORT --username="$SERVERUSERNAME" "./"

Problem about this restore-method is, even though I assigned multiple cores to it, it only uses one, with barely 4% CPU used on the server-core.

Case 2

Dumping in custom format is extremely slow, that the server even couldn't complete it overnight (Session timeout).

pg_dump --blobs --compress=9 --dbname="$dbname" --file="$DUMPDIR/db.dump" --format=custom --host=$SERVERHOSTNAME --port=$SERVERPORT --username=$SERVERUSERNAME

So I had different approaches in mind:

  1. dump it with approach #1, convert it afterwards (how?) and use a faster restore method (variant #2? )
  2. Creating multiple dumps simultaniously on different cores but with different schemas (Having a total of 6), and then merge them back (how?)

Piping seems to be an ineffective way of dumping according to the author stated above.

Does anyone have more experience in this? And are my approach-ideas useful, or do you have a complete different solution in mind?

Oh, before I forget: We are currently limited to 5TB on our external server, and the internal server which runs the db should not get bloated with data-fragments, even temporarily.

Pinup answered 14/3, 2019 at 13:48 Comment(0)
M
4

A parallel pg_restore with the directory format should speed up processing.

If it doesn't, I suspect that much of the data is in one large table, which pg_restore (and pg_dump) cannot parallelize.

Make sure you disable compression (-z 0) to improve the speed (unless you have a weak network).

You might be considerably faster with an online file system backup:

  • pg_basebackup is simple, but cannot be parallelized.

  • Using the low-level API, you can parallelize the backup with operating system or storage techniques.

The disadvantage is that with a file system backup, you can only copy the whole database cluster.

Mucosa answered 14/3, 2019 at 14:30 Comment(5)
Thanks for the reply! Yes, we have one large table, thus making parallizing impossible. (See case #1) I hoped that the compression increases the performance on restore, sadly it doesn't. I was thinking about using a simple rsync to get at least the initial reference backup, so we can run tests on the other node, and then later migrate to another pg_dump solution. Will have a look on both methods you suggested.Pinup
Found out what was causing an issue for multithreading on restore: The dbname wasn't specifically stated, and the toc.dat seemed to use a wrong dbname, and tried to restore everything on the output-file, instead of the database. Manually setting a dbname in restore fixed the restore-issue, works now like a charm. Still, there is still plenty of room to improve, thanks a lot still!Pinup
So I have a similar situation where I'll be moving a much bigger database 5TB over AWS infrastructure. So the dump will be done on to an S3 and restore will be done to another account RDS. How long did the 1TB restore take you and how many threads did you spawn for the restore?Gow
Note that on AWS you can share RDS snapshots to other accounts and spin them up.Demonography
@AtroCty That must have been an old version of pg_restore. These days, it requires to either specify -f or -d.Mucosa

© 2022 - 2024 — McMap. All rights reserved.