Making sense of Postgres row sizes

W

2

28

I got a large (>100M rows) Postgres table with structure {integer, integer, integer, timestamp without time zone}. I expected the size of a row to be 3*integer + 1*timestamp = 3*4 + 1*8 = 20 bytes.

In reality the row size is pg_relation_size(tbl) / count(*) = 52 bytes. Why?

(No deletes are done against the table: pg_relation_size(tbl, 'fsm') ~= 0)

Wincer answered 26/11, 2012 at 18:8 Comment(0)

C

63

Calculation of row size is more complex than that.

Storage is typically partitioned in data pages of 8 kB. There is a small fixed overhead per page, possible remainders not big enough to fit another tuple, and more importantly dead rows or a percentage initially reserved with the FILLFACTOR setting.

And there is more overhead per row (tuple): an item identifier of 4 bytes at the start of the page, the HeapTupleHeader of 23 bytes and alignment padding. The start of the tuple header as well as the start of tuple data are aligned at a multiple of MAXALIGN, which is 8 bytes on a typical 64-bit machine. Some data types require alignment to the next multiple of 2, 4 or 8 bytes.

Quoting the manual on the system table pg_type:

typalign is the alignment required when storing a value of this type. It applies to storage on disk as well as most representations of the value inside PostgreSQL. When multiple values are stored consecutively, such as in the representation of a complete row on disk, padding is inserted before a datum of this type so that it begins on the specified boundary. The alignment reference is the beginning of the first datum in the sequence.

Possible values are:

c = char alignment, i.e., no alignment needed.

s = short alignment (2 bytes on most machines).

i = int alignment (4 bytes on most machines).

d = double alignment (8 bytes on many machines, but by no means all).

Read about the basics in the manual.

Your example

This results in 4 bytes of padding after your 3 integer columns, because the timestamp column requires double alignment and needs to start at the next multiple of 8 bytes.

So, one row occupies:

   23   -- heaptupleheader
 +  1   -- padding or NULL bitmap
 + 12   -- 3 * integer (no alignment padding here)
 +  4   -- padding after 3rd integer
 +  8   -- timestamp
 +  0   -- no padding since tuple ends at multiple of MAXALIGN

Plus item identifier per tuple in the page header (as pointed out by @A.H. in the comment):

 +  4   -- item identifier in page header
------
 = 52 bytes

So we arrive at the observed 52 bytes.

The calculation pg_relation_size(tbl) / count(*) is a pessimistic estimation. pg_relation_size(tbl) includes bloat (dead rows) and space reserved by fillfactor, as well as overhead per data page and per table. (And we didn't even mention compression for long varlena data in TOAST tables, since it doesn't apply here.)

You can install the additional module pgstattuple and call SELECT * FROM pgstattuple('tbl_name'); for more information on table and tuple size.

So then Postgres is not so good for huge tables with very short rows, eg. couple of ints. The 28 bytes overhead will always bloat it. Do you know if Postgres compresses these tables when holding them in cache? – Wincer 26/11, 2012 at 18:33

Isn't there also an addition per-row overhead in each block: The ItemData pointer (4 byte) to the actual tuple header? – Dahlia 26/11, 2012 at 18:33

@A.H.: Good point. Not part of the tuple itself, but the ItemData pointer is allocated in the page header per tuple and should explain the difference between 48 bytes in my calculation and the observed 52 bytes of disk space. I added a note to my answer. – Cornuted 26/11, 2012 at 18:38

@Arman: Representation of data in RAM needs even a bit more space. So no, no compression there. If you have long character strings they are compressed and possibly "toasted". More about TOAST in the manual here. So, there is a considerable overhead for very small tuples. Still, operations on tables are usually very fast, so don't fall for the temptation to prematurely denormalize your tables. If in doubt, run performance tests. – Cornuted 26/11, 2012 at 18:49

@ErwinBrandstetter: Hm, how do we make sense of this then: A 400M RAM Postgres server, with a 4G database (1 table), serving 30K cpm load with ease. About 50% reads/50% inserts into that 1 table. How can it possibly handle this much load with disk reads/writes? – Wincer 26/11, 2012 at 18:58

The table doesn't need to be cached as a whole. Also, there is a ton of other factors here .. – Cornuted 26/11, 2012 at 19:9

B

6

Each row has metadata associated with it. The correct formula is (assuming naïve alignment):

3 * 4 + 1 * 8 == your data
24 bytes == row overhead
total size per row: 23 + 20

Or roughly 53 bytes. I actually wrote postgresql-varint specifically to help with this problem with this exact use case. You may want to look at a similar post for additional details re: tuple overhead.

Butter answered 26/11, 2012 at 18:15 Comment(0)

Your example

Recommended topics

Hot tags