Tainted Data and Validating/Sanitising

A lot of programmers seem to know a lot about all aspects of programming EXCEPT security. They can write brilliant code but it is almost impossible to maintain because they omit implementing the most basic of security features.

To start with let's consider the concept of tainted and untainted data. The first thing to note about this most basic security technique (covered early in any decent programming security book of the past couple of decades or so - for example on page 10 of Chris Shiflett's book 'Essential PHP Security') is that it is different from valid/invalid data.

Data is tainted or untainted not on the basis of its value but on the naming convention used for the field NAMES. For example in PHP all variables that start with $_ are untainted (so are $env and usually $row if that's the name you give to the array you retrieve into from the database). All other variables can be untainted provided you never assign a tainted variable name to an otherwise untainted one without validating or sanitising the content first.

Doing stupid worse than meaningless things like:

$email = $_POST['email'];

are so widespread that they effectively taint all the other variable names in your code unless you apply a strict policy of NEVER doing such meaningless assignments (which in this case saves you typing a whole 9 characters each time you want to reference the field in return for tainting ALL PHP variables other than those you explicitly set aside to be untainted (Chris Shiflett suggests setting aside $clean[] to use for untainted fields when this happens - an extra nine characters for every single variable reference in your entire code). Millions of extra characters on untainted fields to save a few dozen characters on tainted ones.

So what's the difference between tainted and untainted - well an untainted variable is one you can tell simply by its name has content that has been validated or sanitised. It therefore must be valid or at least harmless. Tainted variables may also be valid but they may not be - they haven't been tested yet to see if they are or not. When a variable uses a tainted variable name you know you need to validate or sanitise the value before you can use it safely. The distinction between tainted and untainted is in the naming convention used for the variable names and not the content. It means that in a huge program filled with thousands of different paths through the code you can easily tell that the content of a variable is valid by the fact that the variable name tells you that it has to be. You don't have to try to trace all the paths to this point to ensure that the value gets validated along the way because the variables use a naming convention to distinguish tainted fields (that contain anyth9ing) from untainted ones (that have been validated or sanitised and so are known to contain a safe if not a valid value).

The difference between validating and sanitising is that where the value comes from user input you can return an invalid value to be corrected. When the values is supplied by elsewhere in your code (eg another page or a database) you generally expect that it should be valid but you sanitise it so that if it has been tampered with it will at least be made safe.

Validating uses an if statement and calls a built in function where one is available to validate your field eg. is_int() or a validation filter or if neither of those exist then a regular expression and preg_match.

For example, let's say we have three fields for our visitor to enter, an integer, an email address and a field containing anything except adjacent repeating characters (so we have a simple regular expression example):

$int = 0;
$email = '';
$nondup = '';
if (is_int((float)$_POST['int']) {
$int = (int)$_POST['int'];
} else {
// error - not an integer
if (filter_var($_POST['email'], FILTER_VALIDATE_EMAIL) {
$email = $_POST['email'];
} else {
// error - not an email address
if (preg_match('/(.)\1/',$_POST['nondup'])) {
$nondup = $_POST['nondup'];
} else {
// error - field contains consecutive duplicates

Sanitising does the equivalent without an if statement and simply makes the content safe to process if it isn't valid. To sanitise you can cast to a field type eg. (int), use a sanitizing filter or use a regular expression with preg_replace. As you set a valid value in the first place the sanitising filter will do nothing unless the data has been tampered with in which case the value will be made safe.

Foe example let's say we have the same three values being passed in a link from another web page where they are hard coded so we expect the values ought to be valid unless they have been tampered with so we can sanitise them to be certain that if they are tampered with they will do no harm:

$int = (int) $_GET['int'];
$email = filter_var(($_GET['email'], FILTER_SANITIZE_EMAIL);
$nondup = preg_replace('/(.)\1/','$1',$_GET['nondup']);

The use of tainted/untainted naming conventions for variables and validating all user inputs and sanitising all other inputs is the most basic of simple security measures to implement.

Note that using an escaping function such as htmlspecialchars or encode_url has nothing to do with security. Those functions simply allow the program to tell the difference between code and data when they are jumbled together so that your valid data that looks identical to code can be clearly identified as data. The exact escaping function depends on the type of code you are jumbling the data with and is only necessary when your data is being jumbled with the appropriate type of code. Using an escaping function inappropriately just converts your data to gibberish.


This article written by Stephen Chapman, Felgall Pty Ltd.

go to top

FaceBook Follow
Twitter Follow