Gaps in Pseudokeys

If a table in a database either has no natural key or the natural key is so long that using it as the key would be inefficient then we need to consider using a pseudokey instead. Pseudokeys are usually given the same name as the table they belong to with _id added to the end of the name. They usually are defined to contain a number and use autoincrement in mySQL or the equivalent in other versions of SQL to dynamically allocate a value when new rows are created.

As soon as you start deleting rows from a table that uses a pseudokey you end up with gaps in the sequence. If the field is genuinely a pseudokey then these gaps don't matter. The values are just arbitrary numbers to uniquely identify each row in the table and what value relates to rows other than the one you are interested in is irrelevant. These keys are NOT row numbers and so the fact that you eventually end up with lots of gaps in the series makes no difference whatsoever. If you need row numbers then you can dynamically generate them when you extract the particular set of rows you are interested in.

If you find a situation where the gaps matter then either you are mistaken or the value is not a pseudokey but is actually a natural key instead.

One way some people are mistaken is where they are creating lots of new rows which don't last very long and they think that they will run out of numbers for the pseudokey if they don't reorganise the data to fill in the gaps. Well this might in fact be true if they are generating a million new rows a second continuously for 584542 years. If they haven't updated their computer system during that time then they will in fact run out of numbers at that point. If at some stage prior to that computers that support 128 bit processing are introduced then simply upgrading the system will mean that the available range of numbers will become much larger. Further upgrades to the computer system to support even larger numbers at some point during those billions of trillions of years will mean that effectively computers can never run fast enough to use up all of the numbers available to them.

If the key is used outside of the database then it is not a pseudokey, in that situation it is a natural key even if you are generating it within the database. This makes reusing prior values more likely but only where the new value is allocated first outside of the database and is then applied within the database - meaning that you'd set the specific value when creating the row rather than allowing the database to generate a value for it. This should only ever be done if you are certain that every single reference to the prior information identified by that natural key no longer exists. Just how long you need to allow to elapse prior to being able to be certain of this will vary depending on just what the data relates to but for most data a decade or two since the prior entry was deleted might be sufficient. Of course to be absolutely certain that no references to that natural key anywhere else exist you would still be safest to simply let the system generate a new one for you that is known to have never been used before.

The best solution to the gap 'problem' with pseudokeys is to only use pseudokeys where there is no suitable natural key. Where there is no natural key at all (for example with threads and posts in a forum) the pseudokey will effectively become the natural key since those values are the only way to reference the particular post or thread in a way that will identify that one and no other. In other cases the natural key may consist of several fields each of which is relatively long and so the total length of the natural key makes it unsuited to use in the database and so a pseudo key is substituted. For any other circumstance the simplest fix for gaps in the pseudokeys is to delete the pseudokey completely and use the natural key instead.

Gaps in pseudokeys are completely irrelevant since the only purpose in having the key is to provide a unique identifier for the row. Any attempts to fill in the gaps in pseudokeys will at best make the application far less efficient and at worst will completely corrupt the data. There is never a legitimate reason for changing pseudokey values once set.

Natural keys are a completely different matter and there may be rare situations where changing a natural key is necessary. These situations should be relatively rare as otherwise the field(s) being used for the key is probably not the right one to be using for the natural key to that row of data. Where such changes do occur it would be made as a one off update of all of the references to that key inside the database and a corresponding update of everywhere outside the database where that key is also being used.

Gaps in natural keys may be relevant depending on how the natural keys are assigned. One example where they would be relevant would be with sequentially generated receipt numbers where you need to maintain an audit trail of who each receipt was issued to. In this situation a cancelled receipt should still have a corresponding row with something set within that row to identify it as cancelled. A missing row would indicate a problem with the data.

It is the confusion between natural keys and pseudokeys and the confusion between unique row identifiers and row numbers that lead some people to believe that it is important to renumber the pseudokeys to fill in the gaps. Doing so however will just create problems. Explaining this to people without a technical background might be difficult and so simply avoiding the use of pseudokeys as much as possible so that the situation is far less likely to arise in the first place is the obvious solution.


This article written by Stephen Chapman, Felgall Pty Ltd.

go to top

FaceBook Follow
Twitter Follow