There appear to be a very large number of people writing server side code for their web site who appear to have no idea how many of the built in functions provided should be used. These people end up using escaping functions on their input and perhaps do not even consider using filtering functions of any sort.
One of the reasons for this is that the books that teach these languages generally keep their code samples small by omitting all of the filtering and escaping that would normally be added in a live environment in order to demonstrate the small pieces of the code that actually perform the processing that the program exists for. While most of these books do actually state that you will need to add a lot of extra code to perform filtering and escaping in your real world programs, most of them do not actually give examples of how to do this and those that do generally demonstrate both at once in a small example that makes it very obscure as to what parts of the processing are input and which are output.
There are two different types of filtering that you can do on your input - you can sanitise it or you can validate it. Sanitising means that you compare the content provided to a pattern that valid data for that field is expected to match and strip out any characters that are not valid according to that pattern. Whenever you read data from a database or file or anywhere else that you don't normally expect the data to be compromised but where there is at least a potential that it could be you should sanitise the data. If, as you would normally expect, the data has not been tampered with then it will pass through the sanitisation step unchanged. If the data has been corrupted in some way then the sanitisation step will remove any characters that could be harmful to any of the subsequent processing. Note that unsanitised and corrupted data can cause just as much harm within the program itself and need not get as far as being output in order to create issues - that's why the data has to be sanitised first before it is used anywhere in the code.
The only time you don't need to filter your input by sanitising it is where you filter your input by validating it. Where sanitising simply strips out anything that is invalid, validating actually rejects the entire content of the field if the content is considered to not be valid for that particular field. Validation is usually used on inputs that have been supplied by the person running the code and where they enter something incorrectly the validation rejects that field and provides them with an error message asking them to fix the field instead.
With all of the fields having been filtered either by sanitising or validating them you know that the content of each and every field that the code is processing is valid input for that field. It may or may not be actually meaningful but it is at least valid. Valid data should not be able to do any harm in any of the subsequent processing because the subsequent processing should be written in such a way as to be able to handle the data provided that it is valid.
It is unfortunately still possible to have data that is valid but where that valid data has the potential to cause problems with the subsequent processing and where a further step is required to prevent such problems. This situation arises where the code is outputting something where certain content within that output has special meanings. With web pages the two characters that this applies to are < and & which indicate the start of a tag and the start of an entity code respectively. Where the data for a given field is allowed to contain either of these two characters with their regular meanings and we are outputting that data as HTML we need to make sure that those characters in the data are not misinterpreted as having the special meanings that HTML applies to them. We do this by escaping those characters in a way that the HTML will recognise as applying the regular meanings to those characters. As the special meanings apply specifically to HTML there is no point in escaping the characters prior to actually outputting the data in HTML as the way in which data needs to be escaped is dependent on what output the data is to form a part of. As text within HTML our < characters need to be escaped as < and our & characters need to be escaped as & which are the HTML entity codes for the two characters. As HTML entity codes are specific to HTML and XHTML you would actually destroy the meaning of that data if you perform that conversion anywhere prior to outputting the HTML as then the data would only be able to be used as HTML.
You need to escape your data anywhere that it is being included within content where any of that content can have special meanings. You do not need to escape it where there is nothing that it is combined with that has any special meanings - since escaping relates specifically to converting the content in such a way as to be able to tell where special meanings are to be applied and where those same characters are to be considered as data. Where there is a way to keep the data separate so that it is clearly identified as data that has no special meanings then that approach should be used instead of escaping the data as keeping the data completely separate means that there is no possibility whatever of special meanings being inappropriately applied whereas mixing the data with other content where some special meanings apply and escaping the data means that you are relying on having properly escaped everything necessary to prevent the data being misinterpreted. While the escape functions for data being injected into database commands should prevent the data being misinterpreted as a part of the command you are injecting it into, keeping the database command and the data it is to use in separate commands means that a bug in the escape routine does not produce any vulnerability in your code.
There will also be many fields that do not need to be escaped at all as the input filtering will have removed all of the characters that could potentially need escaping as they are invalid for that field. For example there should never be a need to escape someone's name in order to output it in HTML as no one's name should ever contain either of the two characters that have special meanings. This applies even more obviously with things like age (which would always be numeric) or date of birth (where even if you accept it in any format would need to be converted to a standard format by the input filtering).
Perhaps a part of the problem people have in correctly filtering input and escaping output is that the language they are using doesn't actually identify many of the functions that are provided for these purposes in a way that makes it obvious which are which. For example in PHP the strip_tags() function is a filtering function that you can use to sanitise your input. It will completely remove any HTML tags from the data (and so would be an appropriate filter for where you want the data field to be able to contain any plain text but no HTML markup). The htmlspecialchars() function is an escaping function that is suitable for use when outputting data to HTML that can contain those characters that have special meanings (it actually converts a couple of additional characters rather than just the two I mentioned earlier just so as to play safe with the escaping process). It would actually be quite appropriate to use both of these functions on the same data where the < character is allowed as a part of the text but is specifically not allowed to be the start of an HTML tag in the text.
Performing these processes in the right place and applying the appropriate filter for what the field is allowed to contain and the appropriate escape for where the data is being used will prevent problems where inappropriate data is accidentally (or deliberately) entered into a field.
This article written by Stephen Chapman, Felgall Pty Ltd.