sqldatabasedatabase-designdatabase-normalizationrelational

What is database normalisation/normalization?



Solution

  • Normalization is basically to design a database schema such that duplicate and redundant data is avoided. If the same information is repeated in multiple places in the database, there is the risk that it is updated in one place but not the other, leading to data corruption.

    There is a number of normalization levels from 1. normal form through 5. normal form. Each normal form describes how to get rid of some specific problem.

    First normal form (1NF) is special because it is not about redundancy. 1NF disallows nested tables, more specifically columns which allows tables as values. Nested tables are not supported by SQL in the first place, so most normal relational databases will be in 1NF by default. So we can ignore 1NF for the rest of the discussions.

    The normal forms 2NF to 5NF all concerns scenarios where the same information is represented multiple times in the same table.

    For example consider a database of moons and planets:

    Moon(PK) | Planet  | Planet kind
    ------------------------------
    Phobos   | Mars    | Rock
    Daimos   | Mars    | Rock
    Io       | Jupiter | Gas
    Europa   | Jupiter | Gas
    Ganymede | Jupiter | Gas
    

    The redundancy is obvious: The fact that Jupiter is a gas planet is repeated three times, one for each moon. This is a waste of space, but much more seriously this schema makes inconsistent information possible:

    Moon(PK) | Planet  | Planet kind
    ------------------------------
    Phobos   | Mars    | Rock
    Deimos   | Mars    | Rock
    Io       | Jupiter | Gas
    Europa   | Jupiter | Rock <-- Oh no!
    Ganymede | Jupiter | Gas
    

    A query can now give inconsistent results which can have disastrous consequences.

    (Of course a database cannot protect against wrong information being entered. But it can protect against inconsistent information, which is just as serious a problem.)

    The normalized design would split the table into two tables:

    Moon(PK) | Planet(FK)     Planet(PK) | Planet kind
    ---------------------     ------------------------
    Phobos   | Mars           Mars       | Rock
    Deimos   | Mars           Jupiter    | Gas
    Io       | Jupiter 
    Europa   | Jupiter 
    Ganymede | Jupiter 
    

    Now no fact is repeated multiple times, so there is no possibility of inconsistent data. (It may look like there still is some repetition since the planet names are repeated, but repeating primary key values as foreign keys does not violate normalization since it does not introduce a risk of inconsistent data.)

    Rule of thumb If the same information can be represented with fewer individual cell values, not counting foreign keys, then the table should be normalized by splitting it into more tables. For example the first table has 12 individual values, while the two tables only have 9 individual (non-FK) values. This means we eliminate 3 redundant values.

    We know the same information is still there, since we can write a join query which return the same data as the original un-normalized table.

    How do I avoid such problems? Normalization problems are easily avoided by giving a bit of though to the conceptual model, for example by drawing an entity-relationship diagram. Planets and moons have a one-to-many relationship which means they should be represented in two different tables with a foreign key-association. Normalization issues happen when multiple entities with a one-to-many or many-to-many relationship are represented in the same table row.

    Is normalization it important? Yes, it is very important. By having a database with normalization errors, you open the risk of getting invalid or corrupt data into the database. Since data "lives forever" it is very hard to get rid of corrupt data when first it has entered the database.

    But I don't really think it is important to distinguish between the different normal forms from 2NF to 5NF. It is typically pretty obvious when a schema contains redundancies - whether it is 3NF or 5NF which is violated is less important as long as the problem is fixed.

    (There are also some additional normal forms like DKNF and 6NF which are only relevant for special purpose systems like data-warehouses.)

    Don't be scared of normalization. The official technical definitions of the normalization levels are quite obtuse. It makes it sound like normalization is a complicated mathematical process. However, normalization is basically just the common sense, and you will find that if you design a database schema using common sense it will typically be fully normalized.

    There are a number of misconceptions around normalization:

    Does it apply to anything outside of databases? Not directly, no. The principles of normalization is quite specific for relational databases. However the general underlying theme - that you shouldn't have duplicate data if the different instances can get out of sync - can be applied broadly. This is basically the DRY principle.