Why do database guys go on about normalisation?
What is it? How does it help?
Does it apply to anything outside of databases?
Why do database guys go on about normalisation?
What is it? How does it help?
Does it apply to anything outside of databases?
Normalization is basically to design a database schema such that duplicate and redundant data is avoided. If the same information is repeated in multiple places in the database, there is the risk that it is updated in one place but not the other, leading to data corruption.
There is a number of normalization levels from 1. normal form through 5. normal form. Each normal form describes how to get rid of some specific problem.
First normal form (1NF) is special because it is not about redundancy. 1NF disallows nested tables, more specifically columns which allows tables as values. Nested tables are not supported by SQL in the first place, so most normal relational databases will be in 1NF by default. So we can ignore 1NF for the rest of the discussions.
The normal forms 2NF to 5NF all concerns scenarios where the same information is represented multiple times in the same table.
For example consider a database of moons and planets:
Moon(PK) | Planet | Planet kind
------------------------------
Phobos | Mars | Rock
Daimos | Mars | Rock
Io | Jupiter | Gas
Europa | Jupiter | Gas
Ganymede | Jupiter | Gas
The redundancy is obvious: The fact that Jupiter is a gas planet is repeated three times, one for each moon. This is a waste of space, but much more seriously this schema makes inconsistent information possible:
Moon(PK) | Planet | Planet kind
------------------------------
Phobos | Mars | Rock
Deimos | Mars | Rock
Io | Jupiter | Gas
Europa | Jupiter | Rock <-- Oh no!
Ganymede | Jupiter | Gas
A query can now give inconsistent results which can have disastrous consequences.
(Of course a database cannot protect against wrong information being entered. But it can protect against inconsistent information, which is just as serious a problem.)
The normalized design would split the table into two tables:
Moon(PK) | Planet(FK) Planet(PK) | Planet kind
--------------------- ------------------------
Phobos | Mars Mars | Rock
Deimos | Mars Jupiter | Gas
Io | Jupiter
Europa | Jupiter
Ganymede | Jupiter
Now no fact is repeated multiple times, so there is no possibility of inconsistent data. (It may look like there still is some repetition since the planet names are repeated, but repeating primary key values as foreign keys does not violate normalization since it does not introduce a risk of inconsistent data.)
Rule of thumb If the same information can be represented with fewer individual cell values, not counting foreign keys, then the table should be normalized by splitting it into more tables. For example the first table has 12 individual values, while the two tables only have 9 individual (non-FK) values. This means we eliminate 3 redundant values.
We know the same information is still there, since we can write a join
query which return the same data as the original un-normalized table.
How do I avoid such problems? Normalization problems are easily avoided by giving a bit of though to the conceptual model, for example by drawing an entity-relationship diagram. Planets and moons have a one-to-many relationship which means they should be represented in two different tables with a foreign key-association. Normalization issues happen when multiple entities with a one-to-many or many-to-many relationship are represented in the same table row.
Is normalization it important? Yes, it is very important. By having a database with normalization errors, you open the risk of getting invalid or corrupt data into the database. Since data "lives forever" it is very hard to get rid of corrupt data when first it has entered the database.
But I don't really think it is important to distinguish between the different normal forms from 2NF to 5NF. It is typically pretty obvious when a schema contains redundancies - whether it is 3NF or 5NF which is violated is less important as long as the problem is fixed.
(There are also some additional normal forms like DKNF and 6NF which are only relevant for special purpose systems like data-warehouses.)
Don't be scared of normalization. The official technical definitions of the normalization levels are quite obtuse. It makes it sound like normalization is a complicated mathematical process. However, normalization is basically just the common sense, and you will find that if you design a database schema using common sense it will typically be fully normalized.
There are a number of misconceptions around normalization:
some believe that normalized databases are slower, and the denormalization improves performance. This is only true in very special cases however. Typically a normalized database is also the fastest.
sometimes normalization is described as a gradual design process and you have to decide "when to stop". But actually the normalization levels just describe different specific problems. The problem solved by normal forms above 3rd NF are pretty rare problems in the first place, so chances are that your schema is already in 5NF.
Does it apply to anything outside of databases? Not directly, no. The principles of normalization is quite specific for relational databases. However the general underlying theme - that you shouldn't have duplicate data if the different instances can get out of sync - can be applied broadly. This is basically the DRY principle.
Most importantly it serves to remove duplication from the database records. For example if you have more than one place (tables) where the name of a person could come up you move the name to a separate table and reference it everywhere else. This way if you need to change the person name later you only have to change it in one place.
It is crucial for proper database design and in theory you should use it as much as possible to keep your data integrity. However when retrieving information from many tables you're losing some performance and that's why sometimes you could see denormalised database tables (also called flattened) used in performance critical applications.
My advise is to start with good degree of normalisation and only do de-normalisation when really needed
P.S. also check this article: http://en.wikipedia.org/wiki/Database_normalization to read more on the subject and about so-called normal forms
Normalization a procedure used to eliminate redundancy and functional dependencies between columns in a table.
There exist several normal forms, generally indicated by a number. A higher number means fewer redundancies and dependencies. Any SQL table is in 1NF (first normal form, pretty much by definition) Normalizing means changing the schema (often partitioning the tables) in a reversible way, giving a model which is functionally identical, except with less redundancy and dependencies.
Redundancy and dependency of data is undesirable because it can lead to inconsisencies when modifying the data.
It is intended to reduce redundancy of data.
For a more formal discussion, see the Wikipedia http://en.wikipedia.org/wiki/Database_normalization
I'll give a somewhat simplistic example.
Assume an organization's database that usually contains family members
id, name, address
214 Mr. Chris 123 Main St.
317 Mrs. Chris 123 Main St.
could be normalized as
id name familyID
214 Mr. Chris 27
317 Mrs. Chris 27
and a family table
ID, address
27 123 Main St.
Near-Complete normalization (BCNF) is usually not used in production, but is an intermediate step. Once you've put the database in BCNF, the next step is usually to De-normalize it in a logical way to speed up queries and reduce the complexity of certain common inserts. However, you can't do this well without properly normalizing it first.
The idea being that the redundant information is reduced to a single entry. This is particularly useful in fields like addresses, where Mr. Chris submits his address as Unit-7 123 Main St. and Mrs. Chris lists Suite-7 123 Main Street, which would show up in the original table as two distinct addresses.
Typically, the technique used is to find repeated elements, and isolate those fields into another table with unique ids and to replace the repeated elements with a primary key referencing the new table.
Quoting CJ Date: Theory IS practical.
Departures from normalization will result in certain anomalies in your database.
Departures from First Normal Form will cause access anomalies, meaning that you have to decompose and scan individual values in order to find what you are looking for. For example, if one of the values is the string "Ford, Cadillac" as given by an earlier response, and you are looking for all the ocurrences of "Ford", you are going to have to break open the string and look at the substrings. This, to some extent, defeats the purpose of storing the data in a relational database.
The definition of First Normal Form has changed since 1970, but those differences need not concern you for now. If you design your SQL tables using the relational data model, your tables will automatically be in 1NF.
Departures from Second Normal Form and beyond will cause update anomalies, because the same fact is stored in more than one place. These problems make it impossible to store some facts without storing other facts that may not exist, and therefore have to be invented. Or when the facts change, you may have to locate all the plces where a fact is stored and update all those places, lest you end up with a database that contradicts itself. And, when you go to delete a row from the database, you may find that if you do, you are deleting the only place where a fact that is still needed is stored.
These are logical problems, not performance problems or space problems. Sometimes you can get around these update anomalies by careful programming. Sometimes (often) it's better to prevent the problems in the first place by adhering to normal forms.
Notwithstanding the value in what's already been said, it should be mentioned that normalization is a bottom up approach, not a top down approach. If you follow certain methodologies in your analysis of the data, and in your intial design, you can be guaranteed that the design will conform to 3NF at the very least. In many cases, the design will be fully normalized.
Where you may really want to apply the concepts taught under normalization is when you are given legacy data, out of a legacy database or out of files made up of records, and the data was designed in complete ignorance of normal forms and the consequences of departing from them. In these cases you may need to discover the departures from normalization, and correct the design.
Warning: normalization is often taught with religious overtones, as if every departure from full normalization is a sin, an offense against Codd. (little pun there). Don't buy that. When you really, really learn database design, you'll not only know how to follow the rules, but also know when it's safe to break them.
As Martin Kleppman says in his book Designing Data Intensive Applications:
Literature on the relational model distinguishes several different normal forms, but the distinctions are of little practical interest. As a rule of thumb, if you’re duplicating values that could be stored in just one place, the schema is not normalized.
Normalization is one of the basic concepts. It means that two things do not influence on each other.
In databases specifically means that two (or more) tables do not contain the same data, i.e. do not have any redundancy.
On the first sight that is really good because your chances to make some synchronization problems are close to zero, you always knows where your data is, etc. But, probably, your number of tables will grow and you will have problems to cross the data and to get some summary results.
So, at the end you will finish with database design that is not pure normalized, with some redundancy (it will be in some of the possible levels of normalization).
© 2022 - 2024 — McMap. All rights reserved.