What's the optimal way to store binary flags / boolean values in each database engine?
Asked Answered
F

3

6

I've seen some possible approaches (in some database engines some of them are synonyms):

  1. TINYINT(1)
  2. BOOL
  3. BIT(1)
  4. ENUM(0,1)
  5. CHAR(0) NULL

All major database engine supported by PHP should be noted, but just as a refference it'll be even better if also other engines will be noted.

I'm asking for a design that is best optimized for reading. e.g. SELECTing with the flag field in the WHERE condition, or GROUP BY the flag. Performance is much more important than storage space (except when the size has an impact on performance).

And some more details:

While creating the table I can't know if it'll be sparse (if most flags are on or off), but I can ALTER the tables later on, so if there is something I can optimize if I know that, it should be noted.

Also if it's make a difference if there is only one flag (or a few) per row, versus many (or a lot of) flags it should be noted.

BTW, I've read somewhere in SO the following:

Using boolean may do the same thing as using tinyint, however it has the advantage of semantically conveying what your intention is, and that's worth something.

Well, in my case it doesn't worth nothing, because each table is represented by a class in my application and everything is explicitly defined in the class and well documented.

Facile answered 26/12, 2010 at 22:41 Comment(0)
S
6

This answer is for ISO/IEC/ANSI Standard SQL, and includes the better freeware pretend-SQLs.

First problem is you have identified two Categories, not one, so they cannot be reasonably compared.

A. Category One

(1) (4) and (5) contain multiple possible values and are one category. All can be easily and effectively used in the WHERE clause. They have the same storage so neither storage nor read performance is an issue. Therefore the remaining choice is simply based on the actual Datatype for the purpose of the column.

ENUM is non-standard; the better or standard method is to use a lookup table; then the values are visible in a table, not hidden, and can be enumerated by any report tool. The read performance of ENUM will suffer a small hit due to the internal processing.

B. Category Two

(2) and (3) are Two-Valued elements: True/False; Male/Female; Dead/Alive. That category is different to Category One. Its treatment both in your data model, and in each platform, is different. BOOLEAN is just a synonym for BIT, they are the same thing. Legally (SQL-wise) there are handled the same by all SQL-compliant platforms, and there is no problem using it in the WHERE clause.

The difference in performance depends on the platform. Sybase and DB2 pack up to 8 BITs into one byte (not that storage matters here), and map the power-of-two on the fly, so performance is really good. Oracle does different things in each version, and I have seen modellers use CHAR(1) instead of BIT, to overcome performance problems. MS was fine up to 2005 but they have broken it with 2008, as in the results are unpredictable; so the short answer may be to implement it as CHAR(1).

Of course, the assumption is that you do not do silly things such as pack 8 separate columns in to one TINYINT. Not only is that a serious Normalisation error, it is a nightmare for coders. Keep each column discrete and of the correct Datatype.

C. Multiple Indicator & Nullable Columns

This has nothing to do with, and is independent of, (A) and (B). What the columns correct Datatype is, is separate to how many you have and whether it is Nullable. Nullable means (usually) the column is optional. Essentially you have not completed the modelling or Normalisation exercise. The Functional Dependencies are ambiguous. if you complete the Normalisation exercise, there will be no Nullable columns, no optional columns; either they clearly exist for a particular relation, or they do not exist. That means using the ordinary Relational structure of Supertype-Subtypes.

Sure, that means more tables, but no Nulls. Enterpise DBMS have no problem with more tables or more joins, that is what they are optimised for. Normalised databases perform much better than unnormalised or denormalised ones, and they can be extended without "re-factoring'. You can ease the use by supplying a View for each Subtype.

If you want more information on this subject, look at this question/answer. If you need help with the modelling, please ask a new question. At your level of questioning, I would advise that you stick with 5NF.

D. Performance of Nulls

Separately, if performance is important to you, then exclude Nulls. Each Nullable column is stored as variable length; that requires additional processing for each row/column. The enterprise databases use a "deferred" handling for such rows, to allow the logging, etc to move thought the queues without impeding the fixed rows. In particular never use variable length columns (that includes Nullable columns) in an Index: that requires unpacking on every access.

E. Poll

Finally, I do not see the point in this question being a poll. It is fair enough that you will get technical answers, and even opinions, but polls are for popularity contests, and the technical ability of responders at SO covers a very range, so the most popular answers and the most technically correct answers are at two different ends of the spectrum.

Swellfish answered 27/12, 2010 at 2:32 Comment(2)
Could you include a link or further info on how "MS was fine up to 2005 but they have broken it with 2008, as in the results are unpredictable" and if it is broken in 2K8R2Ekaterinburg
@RC. I don't have links, I have experience. Wiki hasn't heard about it yet. The results of the specific above context, and some others, not all contexts, are unpredictable; if you had code that performed well under 2005. I have posted several other specifics: removal of overflow pages and hurting all performance on Clustered indices, etc. Feel free to read them. Not fixed in Rev 2. There is no chance of these elements being fixed until Rev 4 at least. It took MS over 3 years to fix 2005.Swellfish
A
1

I know this is not the answer you want, but the difference is really negligeble in all but the most extreme special cases. And in each such specific case, simply switching datatype won't be enough to fix a performance problem.

For example, here are some alternatives that will outperform any datatype changes by a large factor. Each carries with it a downside of course.

If you have 200 optional flags and you query for at most 1-2 at a time for lots of rows, you would get better performance by having each flag in its own table. If the data is really sparse this gets even better.

If you have 200 mandatory flags and you only perform single record fetches, you should put them in the same table.

If you have a small set of flags, you could pack them in one column using a bitmask, which is efficient storage wise, but you won't be able to (easily) query individual flags. Of course, this doesn't work when flags can be NULL...

Or you could get creative and use a "junk dimension" concept, in which you create a separate table with all 200 boolean flags represented as columns. Create one row for each distinct combination of flag values. Each row gets an autoincrement primary key, which you reference in the master record. Voila, the master table now contains 1 int, instead of 200 columns. Hackers heaven, DBA nightmare.

The point I'm trying to make is that even though it is interesting to argue over which is "the best", there are other concerns that are of much greater importance (like the comment you quoted). Simply because when you encounter a real performance issue, the datatype will neither be the problem nor the solution.

Assort answered 27/12, 2010 at 0:8 Comment(0)
H
0

Any of the above is fine and I have a personal preference of using BOOL if it is properly supported because that best conveys your intent but I would avoid using ENUM(0,1).

The first problem with ENUM is that it requires its value to be a string. 0 and 1 looks like a number so programmers have a tendency to send it a number.

The second problem with ENUM is that if you send it a wrong value it defaults to the first enumeration and in some databases it won't even indicate an error (I'm looking at you MySQL). This makes the first problem much worse since if you accidentally send it 1 instead of "1" it will store the value "0" -- very counter-intuitive!

I don't think this affects all database engines (don't know, havent't tried them all) but it affects enough of them that I consider avoiding it to be good practice.

Haem answered 27/12, 2010 at 1:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.