Regular Expressions

One class of objects that some object oriented programming languages support is called Regular Expressions. These are basically patterns that can be used for string manipulation.

The typical syntax for a regular expression involves placing the pattern code between slashes with optional characters after the trailing slash. A trailing i indicates that case should be ignored when processing the pattern and g indicates that the pattern should be processed global instead of stopping at the first occurrence, gi would indicate that the pattern should be processed globally ignoring case.

The supported methods will depend on the programming language but a regular expression class will probably support the following methods and possibly many more:

The most complex part of using regular expressions is in determining how to code the patterns (the part between the slashes). A number of characters and character combinations have special meanings, all other characters are expected to match exactly with the characters in the string. Where a single character has a special meaning that special meaning can be overridden by preceding it by a back slash. The characters and character combinations with special meanings are as follows:

To apply those modifiers that affect the preceding character to a larger block of characters you surround the block with parenthesis () so a(bc)?d would match both abcd and ad. To provide alternative characters each of which can be matched you use | so ab|cd would match both abd and acd.

Regular expressions are a very powerful string manipulation tool and programming languages that support regular expressions can easily perform find, replace, and other manipulations of string data in a minimum of coding.

As an example of how you can use regular expressions for string manipulation, let's consider a date which we expect to consist of one or two digits followed by a separator then one or two more digits, a second copy of the same separator and finally four more digits (this works both for regular dates and the US format that reverses the day and month fields). Let's work it out one piece at a time. To test that we begin with a one or two digit number we use ^\d{1,2} which tests that the first one or two characters is numeric. Valid separators are / - or . so to test for the first separator character we use -|\/|\. (note that . needs to be preceded by \ to override its special meaning of any character except new line and restore its normal meaning of dot. / also needs to be preceded by \ as otherwise it would be taken to be the pattern terminator). To check that the second separator matches the first we surround the code for the first separator in parenthesis and specify \1 as the match criteria for the second occurrence. Finally we use \d(4)$ to check that there are four digits at the end for the year. So our final regular expression for testing dates is /^\d{1,2}(-|\/|\.)d{1,2}\1\d(4)$/ and this pattern will match any string that contains a date. Note that in this example we have not validated the individual numbers within their appropriate ranges so it would still accept 3/17/2003 as being a valid date even though there are only twelve months in a year.

 

This article written by Stephen Chapman, Felgall Pty Ltd.

go to top

FaceBook Follow
Twitter Follow
Donate