C# Regular Expression Recipes—Using Common Patterns
| CSharp-Online.NET:Articles |
| C# Articles |
| © 2004 O'Reilly & Assoc., Inc. |
Contents |
Using Common Patterns
Problem
You need a quick list from which to choose regular expression patterns that match standard items. These standard items could be a Social Security Number, a zip code, a word containing only characters, an alphanumeric word, an email address, a URL, dates, or one of many other possible items used throughout business applications. These patterns can be useful in making sure that a user has input the correct data and that it is well-formed. These patterns can also be used as an extra security measure to keep hackers from attempting to break your code by entering strange or malformed input (e.g., SQL injection or cross-site-scripting attacks). Note that these regular expressions are not a silver bullet that will stop all attacks on your system; rather, they are an added layer of defense.
Solution
• Match only alphanumeric characters along with the characters -, +, ., and any whitespace:
^([\w\.+-]|\s)*$
• Match only alphanumeric characters along with the characters -, +, ., and any
whitespace, with the stipulation that there is at least one of these characters and
no more than 10 of these characters:
^([\w\.+-]|\s){1,10}$
• Match a date in the form ##/##/#### where the day and month can be a one- or two-digit value, and year can either be a two- or four-digit value:
^\d{1,2}\/\d{1,2}\/\d{2,4}$
• Match a time to be entered with an optional am or pm extension (note that this regular expression also handles military time):
^\d{1,2}:\d{2}\s?([ap]m)?$
• Match an IP address:
^([0-2]?[0-5]?[0-5]\.){3}[0-2]?[0-5]?[0-5]$
• Verify that an email address is in the form name@address where address is not an
IP address:
^[A-Za-z0-9_\-\.]+@(([A-Za-z0-9\-])+\.)+([A-Za-z\-])+$
• Verify that an email address is in the form name@address where address is an IP address:
^[A-Za-z0-9_\-\.]+@([0-2]?[0-5]?[0-5]\.){3}[0-2]?[0-5]?[0-5]$
• Match only a dollar amount with the optional $ and + or - preceding characters (note that any number of decimal places may be added):
^\$?[+-]?[\d,]*(\.\d*)?$
This is similar to the previous regular expression except that only up to two decimal places are allowed:
^\$?[+-]?[\d,]*\.?\d{0,2}$
• Match a credit card number to be entered as four sets of four digits separated with a space, -, or no character at all:
^((\d{4}[- ]?){3}\d{4})$
• Match a zip code to be entered either as five digits with an optional four-digit extension:
^\d{5}(-\d{4})?$
• Match a North American phone number with an optional area code and an optional - character to be used in the phone number and no extension:
^(\(?[0-9]{3}\)?)?\-?[0-9]{3}\-?[0-9]{4}$
• Match a phone number similar to the previous regular expression, but allow an optional five-digit extension prefixed with either ext or extension:
^(\(?[0-9]{3}\)?)?\-?[0-9]{3}\-?[0-9]{4}(\s*ext(ension)?[0-9]{5})?$
• Match a full path beginning with the drive letter and optionally match a filename
with a three-character extension (note that no .. characters signifying to
move up the directory hierarchy are allowed, nor is a directory name with a . followed
by an extension):
^[a-zA-Z]:[\\/]([_a-zA-Z0-9]+[\\/]?)*([_a-zA-Z0-9]+\.[_a-zA-Z0-9]{0,3})?$
Discussion
Regular expressions are effective at finding specific information, and they have a wide range of uses. Many applications use them to locate specific information within a larger range of text, as well as to filter out bad input. The filtering action is very useful in tightening the security of an application and preventing an attacker from attempting to use carefully formed input to gain access to a machine on the Internet or a local network. By using a regular expression to allow only good input to be passed to the application, you can reduce the likelihood of many types of attacks, such as SQL injection or cross-site-scripting.
The regular expressions presented in this recipe only provide a minute cross-section of what can be accomplished with them. By taking these expressions and manipulating parts of them, you can easily modify them to work with your application. Take, for example, the following expression which allows only between 1 and 10 alphanumeric characters, along with a few symbols to be allowed as input:
<^([\w\.+-]|\s){1,10}$
By changing the {1,10} part of the regular expression to {0,200}, this expression will
now match a blank entry or an entry of the specified symbols up to and including
200 characters.
Note the use of the ^ character at the beginning of the expression and the $ character
at the end of the expression. These characters start the match at the beginning of the
text and match all the way to the end of the text. Adding these characters forces the
regular expression to match the entire string or none of it. By removing these characters,
you can search for specific text within a larger block of text. For example, the
following regular expression matches only a string containing nothing but a U.S. zip
code (there can be no leading or trailing spaces):
<^\d{5}(-\d{4})?$
This version matches only a zip code with leading or trailing spaces (notice the addition
of the \s* to the start and end of the expression):
<^\s*\d{5}(-\d{4})?\s*$
However, this modified expression matches a zip code found anywhere within a string (including a string containing just a zip code):
\d{5}(-\d{4})?
Use the regular expressions in this recipe and modify them to suit your needs.
See Also
Two good books that cover regular expressions are Regular Expression Pocket Reference by Tony Stubblebine (O’Reilly) and Mastering Regular Expressions, Second Edition, by Jeffrey Friedl (O’Reilly).
|

