One Field With Multiple Values

One mistake relating to database design that is common enough to have several chapters about how to fix it in the book "SQL Antipatterns" is the handling of fields that are allowed to have more than one value at the same time. It appears that there are at least a couple of different ways that people try to handle this situation.

One way people try to handle this is using a single field to store a comma separated list of the values. Using this method means that as long as the field is big enough to hold all the values there is no limit on the number of values that the field can have. The problem with this way of storing the data though is that you make it just about impossible to reference soecific values in the field in your database queries with any query ending up very complex and inefficient.

Another way that people try to handle this problem is to create multiple instance of the field in the one record of the database. This too is not very efficient since you will have numm values where the record has fewer values than you have allowed fields, no way to handle when there are more values than the number of allowed fields, and the database queries while not as complicated as with a comma separated list will still need to test each and every one of the fields for each value you are looking for.

The thing is that the only time that the first of these options is appropriate is when you will never need to do any queries on the field and where the use that is to be made of the field once it is retrieved from the database requires a comma separated list. In that instance you can then look on the comma separated list as the one value that the field contains.

The second of these is more useful in that it is a reasonable solution where each of the fields actually has a separate independent meaning. You may have three or four separate name fields in your record but each name is actually used for something completely separate such as a 'created by' name, 'last modified by' name, 'requested by' name, and perhaps 'completed by' name. Here too while it may look at first glance as if we have a fiend that can contain up to four values it is actually four separate fields since swapping the values between the fields changes their meaning.

So in both cases the only time you ought to use that particular construct is where you only really have one value per field after all, just in the first case that one value just happens to be a comma separated list and in the second case there are a number of similar fields which can at first glance be misinterpreted as one field with multiple values.

Of course no one who knows even the first thing about database design will ever make either of those mistakes. The only time such a situation will occur is where the person doesn't know how to design a database properly in the first place. The very first step in designing a database after identifying all the fields it needs to contain is called normalisation and is a method of working out just which fields belong together in the one table and what fields belong in separate tables. As a minimum each database should have its logical design worked out to at least third normal form before you even start to think about whether that is an appropriate physical design.

One of the steps that you follow in getting the database into third normal form is to remove repeating groups into their own separate table having a many to one relationship back to the original table. Doing this effectively removes all of the fields that need to contain more than one value and replaces them with multiple records in a separate table where each of therecords there holds one of the values that the field needs to have.

By setting up a dependent table likwe this you can have anywhere from zero upwards without limit values for the field all attached to the one field (just as separate records in that second table). This means that there is no artificial limit imposed by the size of the field or the number of fields supplied and if one record needs 5000 values while all the others need only one or two your design can handle that record in exactly the same way it handles all the rest.

Also with storing the values in a separate dependent table you make the queires on the data much simpler since a simple join between the tables provides you with access to all of the possible values for the field using the one common field reference.

Given that following the most basic of database design steps to first normalise the data gives you the best solution for the situation where one field needs to contain multiple values, I don't see how so many people can end up using either of the usually much poorer ways of handling the situation since if they were following the appropriate steps in designing their database then it would require a deliberate decision in going from the logical to the physical design to undo that particular normalisation. Since you'd only do that when the benefits of doing so increase the efficiency of the database code and in almost all cases that change would dramatically decrease efficiency, the only possible reason for those constructs to occur so often is that the people creating them no absolutely nothing about database design.


This article written by Stephen Chapman, Felgall Pty Ltd.

go to top

FaceBook Follow
Twitter Follow