What is Normalisation (or Normalization)? [closed]
Asked Answered
S

7

115

Why do database guys go on about normalisation?

What is it? How does it help?

Does it apply to anything outside of databases?

Slim answered 29/10, 2008 at 13:1 Comment(2)
The only goal of normalization is to avoid redundancy , redundancy and redundancy and ONLY REDUNDANCY go through this SQL Server interview video and check out Normalization questions youtu.be/SEdAF8mSKS4Combes
That video is not helpful & is full of wrong things. One of the 1st things it says is wrong: that replacing values by ids plus a lookup table is normalization. The definitions of higher NFs wrongly refer to PKs; but correct definitions use CKs. It misrepresents what a partial FD is. It uses the wrong word for transitive FDs. It talks about dependency on a prime/CK column when correct is dependency on CKs. Etc. PS That video is affliated with user who commented the link to it.Tomaso
A
192

Normalization is basically to design a database schema such that duplicate and redundant data is avoided. If the same information is repeated in multiple places in the database, there is the risk that it is updated in one place but not the other, leading to data corruption.

There is a number of normalization levels from 1. normal form through 5. normal form. Each normal form describes how to get rid of some specific problem.

First normal form (1NF) is special because it is not about redundancy. 1NF disallows nested tables, more specifically columns which allows tables as values. Nested tables are not supported by SQL in the first place, so most normal relational databases will be in 1NF by default. So we can ignore 1NF for the rest of the discussions.

The normal forms 2NF to 5NF all concerns scenarios where the same information is represented multiple times in the same table.

For example consider a database of moons and planets:

Moon(PK) | Planet  | Planet kind
------------------------------
Phobos   | Mars    | Rock
Daimos   | Mars    | Rock
Io       | Jupiter | Gas
Europa   | Jupiter | Gas
Ganymede | Jupiter | Gas

The redundancy is obvious: The fact that Jupiter is a gas planet is repeated three times, one for each moon. This is a waste of space, but much more seriously this schema makes inconsistent information possible:

Moon(PK) | Planet  | Planet kind
------------------------------
Phobos   | Mars    | Rock
Deimos   | Mars    | Rock
Io       | Jupiter | Gas
Europa   | Jupiter | Rock <-- Oh no!
Ganymede | Jupiter | Gas

A query can now give inconsistent results which can have disastrous consequences.

(Of course a database cannot protect against wrong information being entered. But it can protect against inconsistent information, which is just as serious a problem.)

The normalized design would split the table into two tables:

Moon(PK) | Planet(FK)     Planet(PK) | Planet kind
---------------------     ------------------------
Phobos   | Mars           Mars       | Rock
Deimos   | Mars           Jupiter    | Gas
Io       | Jupiter 
Europa   | Jupiter 
Ganymede | Jupiter 

Now no fact is repeated multiple times, so there is no possibility of inconsistent data. (It may look like there still is some repetition since the planet names are repeated, but repeating primary key values as foreign keys does not violate normalization since it does not introduce a risk of inconsistent data.)

Rule of thumb If the same information can be represented with fewer individual cell values, not counting foreign keys, then the table should be normalized by splitting it into more tables. For example the first table has 12 individual values, while the two tables only have 9 individual (non-FK) values. This means we eliminate 3 redundant values.

We know the same information is still there, since we can write a join query which return the same data as the original un-normalized table.

How do I avoid such problems? Normalization problems are easily avoided by giving a bit of though to the conceptual model, for example by drawing an entity-relationship diagram. Planets and moons have a one-to-many relationship which means they should be represented in two different tables with a foreign key-association. Normalization issues happen when multiple entities with a one-to-many or many-to-many relationship are represented in the same table row.

Is normalization it important? Yes, it is very important. By having a database with normalization errors, you open the risk of getting invalid or corrupt data into the database. Since data "lives forever" it is very hard to get rid of corrupt data when first it has entered the database.

But I don't really think it is important to distinguish between the different normal forms from 2NF to 5NF. It is typically pretty obvious when a schema contains redundancies - whether it is 3NF or 5NF which is violated is less important as long as the problem is fixed.

(There are also some additional normal forms like DKNF and 6NF which are only relevant for special purpose systems like data-warehouses.)

Don't be scared of normalization. The official technical definitions of the normalization levels are quite obtuse. It makes it sound like normalization is a complicated mathematical process. However, normalization is basically just the common sense, and you will find that if you design a database schema using common sense it will typically be fully normalized.

There are a number of misconceptions around normalization:

  • some believe that normalized databases are slower, and the denormalization improves performance. This is only true in very special cases however. Typically a normalized database is also the fastest.

  • sometimes normalization is described as a gradual design process and you have to decide "when to stop". But actually the normalization levels just describe different specific problems. The problem solved by normal forms above 3rd NF are pretty rare problems in the first place, so chances are that your schema is already in 5NF.

Does it apply to anything outside of databases? Not directly, no. The principles of normalization is quite specific for relational databases. However the general underlying theme - that you shouldn't have duplicate data if the different instances can get out of sync - can be applied broadly. This is basically the DRY principle.

Activator answered 29/10, 2008 at 13:14 Comment(9)
The example you gave for first normal isn't exactly correct. I always remember the first three normal forms by the terms repeating, redundant, non-dependent. Repeating data refers to when novice database developers write table defs that include columns like DogName1, DogName2, DogName3, etc.Knute
How come normalization is not needed in object orientational programming - but only when it comes to databases? I think there already is a normalization proccess built in with the core of what object orientational programming is all out - is it not?Beck
@Beck No, not at all. You can always map the state of an OO design trivially to a "relational" database/design with tables--that's what ORM is--but the database/design you get is a relational represenation of the OO design, not a relational representation of the business, and not only is the relational representation of the OO design clearly subject to update anomalies & redundancy that normalization manages but the OO methods have to enforce the appropriate (undocumented) (complex) constraints ("representation invariant") by hand.Tomaso
@Lealo: The principle does apply to OOP to the extent that you should never have the same information (in mutable form) in two different objects, since they may then become out of sync.Activator
@Bill: DogName1, DogName2, DogName3, etc. might be a bad design, but it doesn't violate first normal form.Activator
@Activator Actually it does. Simply turning a multi-value attribute into multiple fields does not make it any less a violation of 1NF.Knute
@Bill: Yes it does. 1NF means that no column has sets or relations as allowed values. Multiple fields with single values does not violate 1NF.Activator
@Activator We will have to agree to disagree on this. Since SQL doesn't support multi-value fields, 1NF has to be interpreted as indicating don't turn multi-value attributes into multiple fields.Knute
@Bill: 1NF is defined in E.F.Codds paper "A Relational Model of Data for Large Shared Data Banks". It is unambiguously defined in terms of domains, i.e. the allowed data types for columns. Nested tables are disallowed because supporting this would complicate a query language needlessly. This paper inspired the design of the SQL language. The normal forms have a precise definitions, it not really something which can to be "interpreted" to mean something entirely different.Activator
U
20

Most importantly it serves to remove duplication from the database records. For example if you have more than one place (tables) where the name of a person could come up you move the name to a separate table and reference it everywhere else. This way if you need to change the person name later you only have to change it in one place.

It is crucial for proper database design and in theory you should use it as much as possible to keep your data integrity. However when retrieving information from many tables you're losing some performance and that's why sometimes you could see denormalised database tables (also called flattened) used in performance critical applications.

My advise is to start with good degree of normalisation and only do de-normalisation when really needed

P.S. also check this article: http://en.wikipedia.org/wiki/Database_normalization to read more on the subject and about so-called normal forms

Unsuspecting answered 29/10, 2008 at 13:7 Comment(3)
You'd also be quite surprised how little denormalisation is really needed in transactional apps. In one monster application I did the data model for, a schema with 560 tables had only 4 items of denormalised data.Boxhaul
It prevents "update anomalies". It does this by eliminating certain kinds of duplication.Locris
"My advise is to start with good degree of normalisation and only do de-normalisation when really needed". This one advice is a very bad one! I still did not see any proper illustration of this "pseudo-theory". Minus 1.Lemures
R
7

Normalization a procedure used to eliminate redundancy and functional dependencies between columns in a table.

There exist several normal forms, generally indicated by a number. A higher number means fewer redundancies and dependencies. Any SQL table is in 1NF (first normal form, pretty much by definition) Normalizing means changing the schema (often partitioning the tables) in a reversible way, giving a model which is functionally identical, except with less redundancy and dependencies.

Redundancy and dependency of data is undesirable because it can lead to inconsisencies when modifying the data.

Rural answered 29/10, 2008 at 13:13 Comment(0)
S
5

It is intended to reduce redundancy of data.

For a more formal discussion, see the Wikipedia http://en.wikipedia.org/wiki/Database_normalization

I'll give a somewhat simplistic example.

Assume an organization's database that usually contains family members

id, name, address
214 Mr. Chris  123 Main St.
317 Mrs. Chris 123 Main St.

could be normalized as

id name familyID
214 Mr. Chris 27
317 Mrs. Chris 27

and a family table

ID, address
27 123 Main St.

Near-Complete normalization (BCNF) is usually not used in production, but is an intermediate step. Once you've put the database in BCNF, the next step is usually to De-normalize it in a logical way to speed up queries and reduce the complexity of certain common inserts. However, you can't do this well without properly normalizing it first.

The idea being that the redundant information is reduced to a single entry. This is particularly useful in fields like addresses, where Mr. Chris submits his address as Unit-7 123 Main St. and Mrs. Chris lists Suite-7 123 Main Street, which would show up in the original table as two distinct addresses.

Typically, the technique used is to find repeated elements, and isolate those fields into another table with unique ids and to replace the repeated elements with a primary key referencing the new table.

Slip answered 29/10, 2008 at 13:14 Comment(2)
BCNF is not "perfect". Higher normal forms exist, up to 6NF, where all your tables are just a key and a data value. It's rarely ever used, thoughRural
I disagree that BCNF is rarely used and typically denormalized. In fact your normalized example is already in BCNF, and if you denormalize it you would be back to square one.Activator
R
4

Quoting CJ Date: Theory IS practical.

Departures from normalization will result in certain anomalies in your database.

Departures from First Normal Form will cause access anomalies, meaning that you have to decompose and scan individual values in order to find what you are looking for. For example, if one of the values is the string "Ford, Cadillac" as given by an earlier response, and you are looking for all the ocurrences of "Ford", you are going to have to break open the string and look at the substrings. This, to some extent, defeats the purpose of storing the data in a relational database.

The definition of First Normal Form has changed since 1970, but those differences need not concern you for now. If you design your SQL tables using the relational data model, your tables will automatically be in 1NF.

Departures from Second Normal Form and beyond will cause update anomalies, because the same fact is stored in more than one place. These problems make it impossible to store some facts without storing other facts that may not exist, and therefore have to be invented. Or when the facts change, you may have to locate all the plces where a fact is stored and update all those places, lest you end up with a database that contradicts itself. And, when you go to delete a row from the database, you may find that if you do, you are deleting the only place where a fact that is still needed is stored.

These are logical problems, not performance problems or space problems. Sometimes you can get around these update anomalies by careful programming. Sometimes (often) it's better to prevent the problems in the first place by adhering to normal forms.

Notwithstanding the value in what's already been said, it should be mentioned that normalization is a bottom up approach, not a top down approach. If you follow certain methodologies in your analysis of the data, and in your intial design, you can be guaranteed that the design will conform to 3NF at the very least. In many cases, the design will be fully normalized.

Where you may really want to apply the concepts taught under normalization is when you are given legacy data, out of a legacy database or out of files made up of records, and the data was designed in complete ignorance of normal forms and the consequences of departing from them. In these cases you may need to discover the departures from normalization, and correct the design.

Warning: normalization is often taught with religious overtones, as if every departure from full normalization is a sin, an offense against Codd. (little pun there). Don't buy that. When you really, really learn database design, you'll not only know how to follow the rules, but also know when it's safe to break them.

Raybin answered 31/10, 2008 at 13:25 Comment(0)
J
4

As Martin Kleppman says in his book Designing Data Intensive Applications:

Literature on the relational model distinguishes several different normal forms, but the distinctions are of little practical interest. As a rule of thumb, if you’re duplicating values that could be stored in just one place, the schema is not normalized.

Jairia answered 20/4, 2020 at 21:9 Comment(0)
S
2

Normalization is one of the basic concepts. It means that two things do not influence on each other.

In databases specifically means that two (or more) tables do not contain the same data, i.e. do not have any redundancy.

On the first sight that is really good because your chances to make some synchronization problems are close to zero, you always knows where your data is, etc. But, probably, your number of tables will grow and you will have problems to cross the data and to get some summary results.

So, at the end you will finish with database design that is not pure normalized, with some redundancy (it will be in some of the possible levels of normalization).

Shier answered 29/10, 2008 at 13:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.