Is there any difference between integer and bit(n) data types for a bitmask?
Asked Answered
S

3

21

I am working with a table in a PostgreSQL database that has several boolean columns that determine some state (e.g. published, visible, etc.). I want to make a single status column that will store all these values as well as possible new ones in a form of a bitmask. Is there any difference between integer and bit(n) in this case?

This is going to be a rather big table, because it stores objects that users create via a web-interface. So I think I will have to use (partial) indexes for this column.

Silicate answered 3/4, 2012 at 10:32 Comment(0)
C
41

If you only have a few variables I would consider keeping separate boolean columns.

  • Indexing is easy. In particular also indexes on expressions and partial indexes.
  • Conditions for queries are easy to write and read and meaningful.
  • A boolean column occupies 1 byte (no alignment padding). For only a few variables this occupies the least space.
  • Unlike other options boolean columns allow NULL values for individual bits if you should need that. You can always define columns NOT NULL if you don't.

If you have more than a hand full variables but no more than 32, an integer column may serve best. (Or a bigint for up to 64 variables.)

  • Occupies 4 bytes on disk (may require alignment padding, depending on preceding columns).
  • Very fast indexing for exact matches ( = operator).
  • Handling individual values may be slower / less convenient than with varbit or boolean.

With even more variables, or if you want to manipulate the values a lot, or if you don't have huge tables or disk space / RAM is not an issue, or if you are not sure what to pick, I would consider bit(n) or bit varying(n) (short: varbit(n).

For just 3 bits of information, individual boolean columns get by with 3 bytes, an integer needs 4 bytes (maybe additional alignment padding) and a bit string 6 bytes (5 + 1).

For 32 bits of information, an integer still needs 4 bytes (+ padding), a bit string occupies 9 bytes for the same (5 + 4) and boolean columns occupy 32 bytes.

To optimize disk space further you need to understand the storage mechanisms of PostgreSQL, especially data alignment. More in this related answer.

This answer on how to transform the types boolean, bit(n) and integer may be of help, too.

Cholula answered 3/4, 2012 at 12:45 Comment(3)
Thanks a lot for the explanation, that is exactly what I needed! I think I will go with an integer column.Sternum
How can you use an integer serving several independent colums like bit?Pothole
@clarkk: In binary representation, an integer is a sequence of the allowed digits 0 and 1. (That's also how anything is stored in digital systems.) Each of these digits can be (ab-)used to store one independent, binary information. Basic mathematical operators can be used to read / write individual positions. (Postgres writes a new row version for any written change anyway, so barely any additional overhead.) Follow the given link for some basic techniques.Cholula
S
6

With the advent of GENERATED columns in PostgreSQL (as from version 12), you could do something like this (all of the code below is available on the fiddle here):

Base table:

CREATE TABLE test
(
  t_id INTEGER GENERATED BY DEFAULT AS IDENTITY,
  data TEXT,
  bitmask VARBIT(9)
);

but, with GENERATED columns, you can now do:

CREATE TABLE test
(
  t_id INTEGER GENERATED BY DEFAULT AS IDENTITY,
  data TEXT,
  bitmask VARBIT(9), -- choose 9 because it's not 8, to show that you don't have to 
                     -- select an INT or even a SMALLINT
  
  published   BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 0)::BOOLEAN) STORED,
  visible     BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 1)::BOOLEAN) STORED,
  rubbish     BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 2)::BOOLEAN) STORED,
  masterpiece BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 3)::BOOLEAN) STORED,
  meh         BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 4)::BOOLEAN) STORED,
  arts        BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 5)::BOOLEAN) STORED,
  legal       BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 6)::BOOLEAN) STORED,
  sport       BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 7)::BOOLEAN) STORED,
  politics    BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 8)::BOOLEAN) STORED,

  
  
  CONSTRAINT subject_ck  -- so you can't have conflicting subjects - just for demo purposes
    CHECK                -- a document can't be art and legal at the same time!
    (
      CASE
        WHEN GET_BIT(bitmask, 5) = 1
          THEN GET_BIT(bitmask, 6) = 0 AND GET_BIT(bitmask, 7) = 0 AND GET_BIT(bitmask, 8) = 0

        WHEN GET_BIT(bitmask, 6) = 1
          THEN GET_BIT(bitmask, 5) = 0 AND GET_BIT(bitmask, 7) = 0 AND GET_BIT(bitmask, 8) = 0
  
        WHEN GET_BIT(bitmask, 7) = 1
          THEN GET_BIT(bitmask, 6) = 0 AND GET_BIT(bitmask, 5) = 0 AND GET_BIT(bitmask, 8) = 0

        WHEN GET_BIT(bitmask, 5) = 1
          THEN GET_BIT(bitmask, 8) = 0 AND GET_BIT(bitmask, 7) = 0 AND GET_BIT(bitmask, 5) = 0
      END
    ) 
  
);

Now, the extra 9 booleans add about 12 bytes to the size of the table - if that isn't a problem, then we're good to go! Also, when (I presume shortly - as of writing 2022-09-09) PostgreSQL is enhanced with VIRTUAL columns, there'll be no space overhead at all.

The benefits of doing it like this is that it makes your SQL short and readable - instead of having to a bunch of ugly CASE statements, you'll simply be able to do the following:

INSERT INTO test (data, bitmask) VALUES
('Document 1', '000100000'),
('Document 2', '100000000'),
('Document 3', '101000001');

and also, stuff like this:

CREATE INDEX legal_ix ON test (legal) WHERE legal;

So now, obtaining all of the records is far easier on the eye - much more legible:

SELECT * FROM test;

Result:

t_id    data    bitmask   published visible rubbish masterpiece meh arts    legal   sport   politics
1   Document 1  000100000         f       f       f           t   f    f        f       f         f
2   Document 2  100000000         t       f       f           f   f    f        f       f         f
3   Document 3  101000001         t       f       t           f   f    f        f       f         t

You can also do:

BEGIN TRANSACTION;  -- can't update the other way round or CHECK constraint will fail
                    -- CHECK constraints are not deferrable - can't be 8 & 6 simultaneously
UPDATE test  
  SET bitmask = SET_BIT(bitmask, 8, 0) WHERE data = 'Document 3';
UPDATE test
  SET bitmask = SET_BIT(bitmask, 6, 1) WHERE data = 'Document 3';
COMMIT;

Result:

SELECT t_id, published, legal FROM test;  -- legal has gone from f -> t

There are a few other bits and pieces in the fiddle.

Sidewinder answered 9/9, 2022 at 15:57 Comment(0)
S
2

You can apply the bit string functions directly to a bit string without the need to cast from an integer.

Scirrhus answered 3/4, 2012 at 10:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.