Why is my PostgreSQL table larger (in GB) than the csv it came from?
Asked Answered
A

1

14

A < 4 GB csv became a 7.7 GB table in my AWS Postgres instance. And a 14 GB csv wouldn't load into 22 GB of space, I'm guessing because it is also going to double in size! Is this factor of two normal? And if so, why, and is it reliable?

Arrhenius answered 23/3, 2015 at 22:46 Comment(0)
S
29

There are many possible reasons:

  • Indexes take up space. If you have lots of indexes, especially multi-column indexes or GiST / GIN indexes, they can be a big space hog.

  • Some data types are represented more compactly in text form than in a table. For example, 1 consumes 1 byte in csv (or 2 if you count the comma delimiter) but if you store it in a bigint column it requires 8 bytes.

  • If there's a FILLFACTOR set, PostgreSQL will intentionally waste space so make later UPDATEs and INSERTs faster. If you don't know what FILLFACTOR is, then there isn't one set.

  • PostgreSQL has a much larger per-row overhead than CSV. In CSV, the per-row overhead is 2 bytes for a newline and carriage return. Rows in a PostgreSQL table require 24 to 28 bytes, plus data values, mainly because of the metadata required for multiversion concurrency control. So a CSV with very many narrow rows will produce a significantly bigger table than one the same size in bytes that has fewer wider rows.

  • PostgreSQL can do out-of-line storage and compression of values using TOAST. This can make big text strings significantly smaller in the database than in CSV.

You can use octet_size and pg_column_size to get PostgreSQL to tell you how big rows are. Because of TOAST out-of-line compressed storage, the pg_column_size might be different for a tuple produced by a VALUES expression vs one SELECTed from a table.

You can also use pg_total_relation_size to find out how big the table for a given sample input is.

Sherrer answered 24/3, 2015 at 1:50 Comment(1)
Thanks for your thorough answer! I have not yet added indexes, nor did I use fillfactor. I'm guessing for me it's due to the overhead you mentioned, and the column types. I have three labeled simply "integer" and one "text". Perhaps I should look at reclassifying them, and also look into toast.Arrhenius

© 2022 - 2024 — McMap. All rights reserved.