"Behind the Scenes"
|July 2011||The monthly newsletter by Felgall Pty Ltd|
Form Validation and Security
A lot of people get confused about how security works with forms in web pages. There are many different aspects to this security and a lot of people get one mixed up with another. Here I'll go through all of the different aspects of form security, what they are for and how they work. Even if you don't create your own web forms you may find the following useful when it comes to filling out forms on other people's sites.
The best way to ensure that any forms in your web page do not create security holes is to validate all the form fields properly. If your form fields only allow values that actually make sense then the application the form passes that data to will be far more secure than if you attempt to apply generic security solutions (which are often misused and therefore don't provide the expected security anyway. Validate form input properly and the security measures relating to those form fields will be taken care of automatically.
Even without considering security you need to validate form field inputs anyway if you want the application that uses that input to be able to process it correctly. If a field is supposed to collect an email address then someone entering the address of their post office box into the field isn't going to allow emails to be sent.
There are a number of considerations when it comes to validating form fields not all of which provide for security.What we need to also consider is the purpose of each of the alternatives so that each can be used in the most appropriate way and where there are alternatives that the most appropriate alternative is the one that we use.
The next aspect to consider is how the data in the form is to be sent to the server.There are two methods that you can specify with a form to specify how the form data is to be handled - GET and POST. Which of these you choose should be based on the purpose for which these options are provided since this choice has nothing whatever to do with security.
Using the GET method tells the server that the intention is to retrieve information from the server. The information to be retrieved is presumably relatively static and so the browser is to be allowed to cache the results so that if the same request is made again that the results can be redisplayed without having to call the server.
The POST method implies that you are updating something on the server and therefore if the same form data is posted again it does need to be passed to the server since the action to be performed the second time may not be the same as it was the first time and the results will also possibly be different.
What does determine how secure the data is that you are passing from your form to the server is whether the form is on a web page using HTTP or on one using HTTPS. In the former case the form content is passed to the server in plain text. If someone intercepts the transmission (what is known as a "man in the middle" attack) then they will be able to read/use the data that is being passed. The only way of securing the data to prevent a man in the middle from accessing it is to use HTTPS where a security certificate attached to the web page is used to encrypt the form data before sending it and it gets decrypted only by using the same security certificate on the server.
So now we've considered the overall aspects of validation and security for our form it's time to think about how we are going to validate each field. There are a number of different approaches that we can take all of which can achieve the same end result. We can write code that breaks the field content up into its component parts and test each part for specific values, we can examine the field character by character to see whether it is one that is allowed at that spot, we can use a regular expression to see if the content matches a particular pattern, and depending on the language and what the field can contain we may have a simple function call available to use that can do the validation for us. While each of these approaches produces the same result there are specific reasons for choosing one approach over the others - the less code you need to write yourself and the more processing you can hand off to functionality built into the language, the less error prone your code will be and the less likely that it will contain security holes. In other words "don't reinvent the wheel".
Where the language you are using for your validation provides a single function call that can be used to validate the field then that is the best choice to use for validating your code. So if you are validating a field that must be numeric using PHP then testing is_numeric() reduces your validation to a single statement and eliminates the possibility of any invalid data getting through that validation. Of course if the number needs to be within a specific range of values you'd need to test for that too but having already eliminated everything that isn't a number first means that the range testing will also be much simpler.
Of course not everything has its own validation function built right into the language and so most fields will require a bit more than a simple function call. PHP does have many of the main field types covered though with the filter_var() function which has a number of common field types able to be specified using the second parameter to the call (for example filter_var($email, FILTER_VALIDATE_EMAIL) will validate if $a contains a valid email address returning true if it does and false if it doesn't).
At this point it is also worth considering what the difference is between validating a field and sanitizing a field. Validating means testing if the field is valid or not whereas sanitizing simply strips out any characters that are invalid for the particular field type. Even if you sanitize a field it doesn't necessarily mean that what you have left is valid.
Where a single regular expression is insufficient to validate the field (or where a single regular expression would be hugely complex) you might consider using a combination of perhaps two or three regular expressions in combination. You would be really struggling to find a field that requires anything more complex than that to do the validation.
The only excuse that anyone has for using any other approach for doing the validation is if the language they are using doesn't allow regular expressions or that they are a very new coder who hasn't learnt about regular expressions yet and who has fields that are so unusual that they can't locate a regular expression that will do the validation for them or where you need to provide more specific advise as to exactly what it is that is wrong with the input value. There's not much you can do other than to write the validation out the long way if you don't have access to a language that uses regular expressions but for the other two you still don't really have an excuse to not use one. If you don't know how to write a regular expression to validate your field and you can't find the code to do it anywhere then go to an appropriate forum and ask for help in writing it. If you want to provide more information on what specifically is invalid about a value then use a regular expression to validate it first and only if it is invalid then proceed to the more longwinded way of checking the content to identify exactly what's wrong. That way you know that the field is invalid before you spend the time analysing it (which both makes your code more efficient for when the field is valid plus eliminates the possibility of errors in the code doing the analysis resulting in invalid data being accepted as valid).
For your server side validation you need to be as precise as you can in determining the validity of a given field content. For example in validating an email address you need to validate that the field contains something which is allowed to be an email address. There isn't any way to verify if an email address actually exists or not unless you send an email to that address and get the recipient to reply (eg by clicking on a link that only exists in that email). You need to decide depending on the use that you are going to make of the email address as to whether you need to verify the address or to just validate that the content is allowed to be an email address. There's no reason though why you wouldn't do server side validation that eliminates everything that obviously isn't an email address.
Performing the strictest validation that is reasonable on the server automatically improves the security of your form because it eliminates the possibility of something harmful being input into most of the fields on your form. It is impossible to do any sort of HTML or SQL code injection into a field that is being validated as a number or email address because the code that they are trying to inject contains characters that are not valid for those field types.
Once you have all of the proper validations set up for all your fields then the only fields that will accept HTML or SQL being entered into the field are those fields where you expect that type of input. That means that those fields need to be able to handle the HTML or SQL entered into those fields all the time and not just when someone is trying to bypass your security. Since outputting user entered HTML into a web page as HTML creates an automatic security hole in your processing you'd obviously never want to do that without having some very thorough validation to restrict what tags can be used. If instead the HTML is to be displayed as text you'd use htmlspecialchars() or the equivalent for whichever server side language you are using to convert the HTML into plain text immediately before outputting it to the page. Since converting those special characters to entity codes is something you need to do for all the valid input in that field its use there has nothing to do with security. While using such a call on other fields that are not expected to contain HTML when outputting them into a web page prevents any successfully injected HTML from being able to be processed as HTML, such content is still invalid for those fields and if it gets far enough into your code for that call to make a difference then it has already shown that your validation is inadequate and that your data has been compromised. If your barn has no walls then locking the door does prevent the horse getting out through the door but doesn't prevent it getting out and using htmlspecialchars on an email address is as effective at preventing code injection as locking that barn door.
In fact any form of "escaping" data is there simply to prevent the data from being confused with the commands that are acting on that data. Often there are alternative ways to write your code so that the possibility of the code and data being confused can be avoided and escaping the data then becomes unnecessary. For example if you use prepare statements for your database calls then you can use placeholders for almost all of the places where you want to be able to insert data into the call. About the only place where you can't use a placeholder is for specifying the table name (and if you're letting your visitors enter the name of the table to be accessed in your database through a form then you really need to think carefully about how you are going to handle the security of your entire database). By eliminating the possibility of the code and data getting confused you do away with the need to escape the data being input into the database completely.
By validating the fields from your form properly when they are first passed to the server you can greatly simplify all of the subsequent code since once you have eliminated all of the invalid values you no longer need to worry about those invalid values interfering with any of the processing that you perform on those fields. If invalid values can't get into your system in the first place then you not only avoid potential security holes, you also avoid processing meaningless garbage and while there are only a small number of people who try to deliberately insert malicious code into forms so as to bypass security, there are far more people who may accidentally type garbage values into your form and so using the most appropriate validation will have its greatest impact in the area of data integrity.
As you can see, I have started publishing a new series of references on CSS 3 this month. I do have a number of further articles prepared for this series to publish in coming months and will be assembling even more. I have also been working on designing the second phase of my club membership site application (the first phase created a site for a specific club). This second phase will start the provess of extracting all the club specific parts so that the same script will be able to be used for other clubs.
The following links will take you to all of the various pages that have been added to the site or undergone major changes in the last month.