pg_dump with --exclude-table still includes those tables in the background COPY commands it runs?
Asked Answered
O

1

5

I am trying to take a backup of a TimescaleDB database, excluding two very big hypertables. That means that while the backup is running, I would not expect to see any COPY command of the underlying chunks, but I actually do!

Let's say TestDB is my database and it has two big hypertables on schema mySchema called hyper1 and hyper2, as well as other normal tables.

I run the following command:

pg_dump -U user -F t TestDB --exclude-table "mySchema.hyper1" --exclude-table "mySchema.hyper2"  > TestDB_Backup.tar

Then I check the running queries (esp. because I did not expect it to take this long) and I find out that several COPY commands are running, for each chunk of the tables I actually excluded.

This is TimescaleDB version 1.7.4.

Did this ever happen to any of you and what is actually going on here?

ps. I am sorry I cannot really provide a repro for this and that this is more of a discussion than an actual programmatic problem, but I still hope someone has seen this before and can show me what am I missing :)

Oath answered 24/11, 2020 at 20:58 Comment(13)
Even though it says here this can be done, I'm guessing that is not the case. Further guessing it is excluding the parent table, but not the child(chunk) tables.Propagation
Yes, I also read the documentation many times and yet here I am. But I think you might be on to something with the idea of it possibly excl. only the parent. Will look into it more closely, thanks for your comment!Oath
Also what version of pg_dump are you using? Is it the one that came with TimeccaleDB?Propagation
I am using pg_dump (PostgreSQL) 11.9 on a Linux machineOath
You can use patterns with --exclude-table/-T to match multiple tables. See here Patterns.Propagation
Forgot that TimescaleDB is an extension on existing Postgres install, so you will be using the Postgres core pg_dump.Propagation
Yep! Also changing to using a pattern might bring me to the same place (so basically that only does the specifying of the table names easier for me, but I am afraid the background COPY will run just the same) - I could give it a try though!Oath
The COPY will only run for those tables that pg_dump thinks need to be outputted. If a table name matches the pattern it will not be outputted.Propagation
So you mean you think specifying it one by one vs specifying it via a pattern will make the difference? I will give it a try now!Oath
It just starts with copying the hyper chunks. So weird. Thanks anyway! :)Oath
No I mean you can exclude multiple tables in one --exclude-table if they match the pattern. So assuming hyper1 has child tables named hyper1a, hyper1b, etc the ` --exclude-table hyper1* will match them all.Propagation
Is the schema actually mySchema in your database? If so then you will need to do --exclude-table '"mySchema.hyper2"' ( single quotes enclosing double quotes) to preserve the case and have it match.Propagation
Yes, that was fine. The issue was what you mentioned in the beginning that even though the hypertables are excluded, the chunks are still dumped.Oath
P
8

pg_dump dumps each child table separately and independently from their parents, thus when you exclude a hypertable, its chunk tables will be still dumped. Thus you observe all chunk tables are still dumped.

Note that excluding hypertables and chunks will not work to restore the dump correctly into a TimescaleDB instance, since TimescaleDB metadata will not match actual state of the database. TimescaleDB maintains catalog tables with information about hypertables and chunks and they are just another user tables for pg_dump, so it will dump them (which is important), but when they are restored they will contain all hypertables and chunks, which was in the database before the dump.

So you need to exclude data from the tables you want to exclude (not hypertables or chunks themselves), which will reduce dump and restore time. Then it will be necessary to drop the excluded hypertables after the restore. You exclude table data with pg_dump parameter --exclude-table-data. There is an issue in TimescaleDB GitHub repo, which discusses how to exclude hypertable data from a dump. The issue suggests how to generate the exclude string:

SELECT string_agg(format($$--exclude-table-data='%s.%s'$$,coalesce(cc.schema_name,c.schema_name), coalesce(cc.table_name, c.table_name)), ' ')
FROM _timescaledb_catalog.hypertable h 
  INNER JOIN _timescaledb_catalog.chunk c on c.hypertable_id = h.id 
  LEFT JOIN _timescaledb_catalog.chunk cc on c.compressed_chunk_id = cc.id
WHERE h.schema_name = <foo> AND h.table_name = <bar> ;

Alternatively, you can find hypertable_id and exclude data from all chunk tables prefixed with the hypertable id. Find hypertable_id from catalog table _timescaledb_catalog.hypertable:

SELECT id
FROM _timescaledb_catalog.hypertable
WHERE schema_name = 'mySchema' AND table_name = 'hyper1';

Let's say that the id is 2. Then dump the database according the instructions:

pg_dump -U user -Fc -f TestDB_Backup.bak \
  --exclude-table-data='_timescaledb_internal._hyper_2*' TestDB 
Potted answered 26/11, 2020 at 8:7 Comment(1)
Yes, this is the issue. When excluding a hypertable, pg_dump still dumps the chunks! Thanks for writing this answer, I forgot to update that I found the issue.Oath

© 2022 - 2024 — McMap. All rights reserved.