Django's "dumpdata" or Postgres' "pg_dump"?
Asked Answered
A

1

15

I'm unsure as to whether this question should be posted in the Database Administrators' section or here, so please advise if I got it wrong.

I have a Django-based website which doesn't change much. I use python manage.py dumpdata --all --indent=2 > backup.json and reload the data with loaddata if I need to redeploy or the db gets corrupted. (I'm aware about integrity errors that have occurred when not excluding auth and content_types)

Since I'm using PostgreSQL on the backend, is it "best practise" or "wiser" for me to use pg_dump instead, and then pg_restore if something goes wrong or if I need to redeploy?

So dumpdata dumps all data associated with the selected apps (and/or models), and pg_dump performs a full dump of the db. Is this the same thing or is there a fundamental difference that I've missed (mind you I have 0 experience with DBA)?

Which option do I go for and why?

Anatol answered 29/6, 2016 at 15:4 Comment(0)
F
13

It is both best practice and wiser for you to use pg_dump instead of dumpdata.

There are many reasons for this.

  • pg_dump is faster and the output is more compact (particularly with the -Fc option) than with dumpdata.

  • Importing the data back into the db with pg_restore will also be faster than django's loaddata.

  • pg_restore is available on any postgresql installation but django and it's dependencies you will have to install.

  • Last but not least the integrity errors that you spoke of will not happen with pg_dump/pg_restore.

Generally pg_dump is used to dump the entire database however the -t option allows you to dump one or few tables at a time

Fadden answered 30/6, 2016 at 10:46 Comment(9)
Noted. I've already written a script to backup using pg_dump executing hourly. Since my DB is really tiny (~30MB), I had not noticed a major performance difference. Thank you for your answer.Anatol
You are welcome, if you are dumping regularly, you might want to consider using WAL archiving. postgresql.org/docs/9.2/static/continuous-archiving.htmlFadden
Thanks for the link. I did not know that PostgreSQL had a continuous backup system. At the moment, I use pg_dump and then I commit and push the changes to a private GitHub repo, ensuring incremental changes. I also have the ability to restore data at an point in time, with a worst case scenario of an hour's loss of work. Should I still consider using WAL? Is it worth the effort (time-wise) to implement WAL? It looks a bit complex in comparison to pg_dump and git. Any advice is appreciated.Anatol
yes, it's a bit complex to setup initially in comparision to pg_dump + git but the advantage is that it does not have any impact at all on the performance (dumpdata > pg_dump > WAL). The other point is that you are adding about 600MB each day to your git repo so your bills would keep adding up.Fadden
True, scalability needs to be taken into account, so performance matters. Thanks for the advice (P.S The Git commits only commit the deltas, so it's more on the order of ~1MB per day. But valid point, I'll take it into account)Anatol
only the deltas for text isn't it? I think for binaries it's the whole file.Fadden
Thanks for pointing that out. So after a day of running the script hourly, my hosting company reported that I've used 140MB upload, including site traffic. I'm a bit unsure about this situation now. Is it perhaps due to the fact that when I inspect the pg_dump file, everything is in plain text? (It's a bunch of sql statements and values, but all human readable)Anatol
Yes, but the diff is still going to be rather large because for every row that is updated, it will store the old record prefixed with a - and the new record prefixed with a + in git.Fadden
Ah, I see! Thank you for that, and for your patience. I'll implement WAL archiving and give it a go. Thanks for the adviceAnatol

© 2022 - 2024 — McMap. All rights reserved.