The Importance of Validation

The one thing left out of the code used in many programming courses is the validation. The obvious reason for the omission is that validation is on average about 60% of the processing required and doesn't actually affect the functionality of the code being discussed provided that valid values are used as input. Since the person writing the script has control over the values that they test with a working script can be produced in a lot less code if the validation is left out. This makes it easier to teach the concepts of what the particular code does in a relatively short piece of code. For the purpose of teaching the concepts of how a programming language works the actual validation code is not necessary and so is usually left out.

Unfortunately many courses do not even have a comment in the code to indicate that all the validation code has been omitted and so many taking the course never realise that most of the code is missing from the examples that they have been working with. They then proceed to start writing their own code the same way as they saw the examples being written. Now usually they at least remember some mention being made about security and so most manage to avoid implementing code that has no security whatsoever. Most however introduce specific code into what they have already written to take care of specific security issues though rather than considering the security aspect from the start. They don't add the missing validation but instead run what they have been told are security functions in lots of inappropriate places in their code. Their code is slightly more secure but is such a mess that there are probably still many ways in which their "security" can be bypassed and there's certainly nothing there preventing the bulk processing of garbage.

When it comes to security, validation is the biggest part. You should never have any inputs to your code where any value at all would be considered valid and so by validating an input you immediately eliminate all the definitely invalid values from further consideration in your code. In most cases the range of meaningful values that a field can have will be so limited that it would be impossible to provide any value in the field that would both pass validation and provide a way to do harmful things in the subsequent processing. Numbers, names, addresses, dates, and so on can never contain anything that looks even remotely like a database command or anything else that could be used to cause any problems if the subsequent processing isn't as secure as you thought. By validating all these fields to ensure that they contain something that is at least plausible content for the field you not only eliminate any subsequent security issues with the field, you also prevent your code being swamped with garbage.

The sorts of measures that most newbies introduce which they consider to be "security" are only needed for fields where code of some sort is allowed to be entered in the first place. Examples of this type of field on the web can be found on forums and blogs where people can ask questions about or comment on various types of code. Even those inputs should not allow anything at all to be input and should validate what is entered so as to ensure that it is meaningful. At the very least it shouldn't contain control characters and should b e under a specific length. If you put a "Tell A Friend" script on your web site you might allow a comment to be entered that will be sent in the email but you wouldn't want to allow any links in the email except the one to the site they are supposed to be telling their friend about and so the validation would need to reject the input if any links were found in the comment.

When data is allowed to contain code of a particular type and that data is to be inserted into that type of code there is a potential security issue even when the data is valid. The only way to be completely secure with this type of situation is to keep the data and the code completely separate. It is for this reason that the database interfaces mysqli and PDO introduced separate prepare and bind statements where the code goes in the prepare statement and the data goes in the corresponding bind statement. The data can then only be misinterpreted as code (and hence represent a security issue) if the implementation of prepare/bind allows it to occur. Since such code has been written and tested by experienced programming professionals and has been thoroughly tested that is extremely unlikely to occur.

In other situations keeping the data separate from the code is not such an easy option. You can keep the data and code completely separate when generating the HTML for a web page if you use a Document Object Model interface to construct the page. Inserting the data in a text node ensures that any HTML that the data contains will be automatically converted to the appropriate entity codes since the entire data is clearly identified as text and so cannot possibly contain code. Few people actually use this approach to generating HTML though and are more likely to simply write all the HTML and data directly. Wrapping the data portions inside htmlspecialchars() (in PHP - or an equivalent in whatever other language you are using) does the entity code conversion for you provided that you remember to actually call the function (whereas writing a text node does the same conversion automatically without any possibility of the conversion being overlooked). These functions that convert the data into a form that allows it to be easily distinguished from the surrounding code are called escape functions.

Instead of validating all their input and only using escape functions where they are necessary for the functioning of the code, many beginners simply use these escape functions as their attempt at providing security in their code. Not only do they not validate the data first, they often use the escape functions in completely the wrong places. Each escape function serves a specific purpose relating to outputting the data inside a specific type of code but these beginners often just wrap all their input in two or three escape functions so as to convert it to meaningless junk prior to doing anything with it at all. The escape functions are intended to be used at the point where you insert the data into the code and running them earlier than that means that you can no longer use the data for any other purpose.

Also relying on these escape functions for security not only means that garbage will be accepted as input, it also means that you are reliant on ensuring that every single one of your input fields which you have decided are allowed to contain anything at all must all be appropriately escaped. Miss escaping just one field and the entire code is compromised. Not only can the code be run with any junk that is input but by determining that one field is not being escaped an attacker can use that input field to modify the code that gets run and effectively do anything that they like.

The most worrying aspect of this is that there are people actually selling code that has been written like this where validation is either completely missing or very basic and where there may even be fields where there is no processing done on the field at all and where a data can therefore be entered that gets misinterpreted as code. I saw an instance of this in a helpdesk script where in providing a copy of some malware code that I found in the script as a part of the query that I posted on the owners helpdesk I ended up receiving an email in HTML format with that malware embedded in it because there was no validation on the field where I was reporting the malware and nothing done with the value entered to identify it as data to be displayed in the email rather than being treated as a part of the HTML of the email. While those who understand code properly can rewrite their purchased code to fix such things, the vast majority of buyers wouldn't be able to do so and so would be purchasing something that could lead to major problems in the future.

The biggest problem relating to this is that while validation makes up 60% of properly written code, the list of what the code does all comes from the other 40%. There is no way to tell from the description of what the code does whether it contains all the necessary validation or not. Now that there are many people producing their own code who have not actually learnt any of the aspects of programming other than perhaps a short course on how to use a particular language, there are lots of people producing code who know little or nothing about proper program design, the correct way to design databases, or how to develop a proper test plan (just to name three of the many aspects of professional program development that wouldn't be covered in a beginners programming course). These people do not even realise that they know nothing at all about the most important aspects of creating proper code. They can sell their code for a small fraction of what a professional would need to charge because they only write 40% of the code and leave out 95% or more of the testing. Unfortunately there is no easy way for the buyer to realise this.


This article written by Stephen Chapman, Felgall Pty Ltd.

go to top

FaceBook Follow
Twitter Follow