Database Normalization

Proper design of a relational database requires that the database first be "normalized". This involves a series of steps that will result in our obtaining a list of entities and relationships between those entities that most effectively represents our data. While in some cases efficiency concerns may cause us to undo some normalizations (or not apply them in the first place), in most instances the normalized form of our entity relationship diagram will be directly implemented in the physical relational database. Where a non-relational database is to be used (eg. a hierarchical database) then further changes will be required in converting our logical design into the physical database but producing a normalized logical design first is still useful.

There are a number of different notations that can be used to create the entity - relationship diagram for our logical database design. Most involve using boxes for the entities (tables) with a line between them indicating a relationship. Notations on or next to the line identify the relationship. One such notation uses the following to indicate that one occurrence of entity B may have zero or one occurrence of entity A (ie. entity A contains optional data associated with entity B):

zero or one to one relationship

Most entities would be related via a one to many relationship. In the following entity C may have zero or more occurrences of entity D associated with it (ie. entity D is optional) and entity E may have one or more occurrences of entity F associated with it (ie. entity F is mandatory).

one to many optional relationship
one to many mandatory relationship

Other relationships between entities can be identified by combining the notations shown in these examples but these are the ones most commonly required particularly once your database is normalized. So now we have agreed on the notation I will use for this discussion let's move on to normalization.

There are two alternate lists of what constitutes 1st, 2nd, and 3rd normalizations but both give the same result for a database in third normal form.

Version One

First Normal Form

Define a primary key field that uniquely identifies each row of the table and identify foreign keys that define the dependencies that exist to other tables.

Second Normal Form

Remove all repeating groups into a separate table (this removes many to many relationships).

Third Normal Form

Remove non-dependent information (ie. information not dependent on the primary key) into a separate table.

Version Two

First Normal Form

Remove repeating groups into a separate table.

Second Normal Form

Remove non-key columns only dependent on part of the key into a separate table

Third Normal Form

Remove non-key columns representing facts about other non-key columns into a separate table

example database in third normal form

You have a database in third normal form when every column in the table represents an unique fact relating to the key, the whole key, and nothing but the key. The above diagram is an example of a database in third normal form (assuming that we have placed each data field in the appropriate entity). Let us consider what this diagram tells us. A supplier produces zero or more products. A warehouse can store zero or more products. Customers can make multiple purchases of products and each purchase is invoiced. Each invoice itemizes the products purchased on a separate line. Products may be purchased multiple times. We can add Suppliers and Warehouses before we add the products that they produce/store and we can add customers before we invoice them. We can also add products before anyone purchases them. We do not produce an invoice unless there is at least one line item identifying a product being purchased. All of the information is stored in the appropriate table so that common data is not repeated between different rows in the same table but is instead referenced from another table using the foreign key.

Third normal form is often as far as you need go with normalizing a database design but three further normalizations exist that apply in some circumstances.

Boyce-Codd Normal Form

Where your database design contains a circular relationship you should redefine the keys so as to remove the circular reference. This situation is detectable when your database design diagram shows two different relationship paths that can be followed to get from one entity to another.

Fourth Normal Form

Where two or more independent multi-value facts occur for an entity then these facts should be split into separate tables. This relates to when you are extracting repeating groups into a separate table. When the fields in the repeating group are not dependent upon one another they should be split into separate tables eg. if your repeating group contains a list of hobbies and a list of jobs then the hobby field is not related to the job field and so you don't create one hobby/job table you create separate hobby and job tables.

Fifth Normal Form

Interdependent columns (columns dependent on an entire composite key) should not be split into separate tables. This needs to be looked at when you have an entity that is on the multiple side of three or more relationships. Unless all of the data that is dependent on each of the related tables is also dependent on every other table then this single entity should be replaced by two or more separate entities handling the relationship between pairs of related tables.

 

This article written by Stephen Chapman, Felgall Pty Ltd.

go to top

FaceBook Follow
Twitter Follow
Donate