C# Regular Expression Recipes—A Better Tokenizer
Microsoft .NET Framework, ASP.NET, Visual C# (CSharp, C Sharp, C-Sharp) Developer Training, Visual Studio
| CSharp-Online.NET:Articles |
| C# Articles |
| © 2004 O'Reilly & Assoc., Inc. |
Contents |
A Better Tokenizer
Problem
A simple method of tokenizing—or breaking up a string into its discrete elements. However, this is not powerful enough to handle all your string-tokenizing needs. You need a tokenizer—also referred to as a lexer—that can split up a string based on a well-defined set of characters.
Solution
Using the Split method of the Regex class, we can use a regular expression to indicate
the types of tokens and separators that we are interested in gathering. This technique
works especially well with equations, since the tokens of an equation are well-defined.
For example, the code:
using System; using System.Text.RegularExpressions; public static string[] Tokenize(string equation) { Regex RE = new Regex(@"([\+\-\*\(\)\^\\])"); return (RE.Split(equation)); }
will divide up a string according to the regular expression specified in the Regex constructor.
In other words, the string passed in to the Tokenize method will be divided
up based on the delimiters +, -, *, (, ), ^, or \. The following method will call the
Tokenize method to tokenize the equation: (y - 3)(3111*x^21 + x + 320):
public void TestTokenize( ) { foreach(string token in Tokenize("(y - 3)(3111*x^21 + x + 320)")) Console.WriteLine("String token = " + token.Trim( )); }
which displays the following output:
String token =
String token = (
String token = y
String token = -
String token = 3
String token = )
String token =
String token = (
String token = 3111
String token = *
String token = x
String token = ^
String token = 21
String token = +
String token = x
String token = +
String token = 320
String token = )
String token =
Notice that each individual operator, parenthesis, and number has been broken out into its own separate token.
Discussion
The tokenizer created in Recipe 2.6 would be useful in specific controlled circumstances.
However, in real-world projects, we do not always have the luxury of being
able to control the set of inputs to our code. By making use of regular expressions,
we can take the original tokenizer and make it flexible enough to allow it to be
applied to any type or style of input we desire.
The key method used here is the Split instance method of the Regex class. The
return value of this method is a string array whose elements include each individual
token of the source string—the equation, in this case.
Notice that the static method allows RegexOptions enumeration values to be used,
while the instance method allows for a starting position to be defined and a maximum
amount of matches to occur. This may have some bearing on whether you
choose the static or instance method.
See Also
See the ".NET Framework Regular Expressions" topic in the MSDN documentation.
|

