March 11th, 2009
Advanced Regular Expressions in C# - 12
If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!

In this third and for now last post on using regular expressions we look at some advanced topics. When your expressions become more complicated they also become harder to understand so documenting them can help. And isn’t standard string replacement a little bit too basic? We also look at how speeding things up can improve your code’s efficiency.
In this post we look at three topics:
- Improving your code’s readability by documenting regular expressions
- Creating conditional string replacement by using MatchEvaluators
- Speeding up regular expressions by compiling them, caching them in memory and pre-compiling them to their own DLL.
If you are new to regular expressions in C# have a look at the theory of regular expression in Regular Expressions : The Basics. The second post Regular Expressions in C#: Practical Usage introduced the most common uses of regular expressions.
Documenting your Regular Expressions
Regular expressions can make for fine alphabet soup. The following expression validates an e-mail address and it does a good job at it. It is also very intimidating at first. So just imagine rereading your code after a few weeks, what is going on in there?
string validEmail = @"\b([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})\b";
With a little squinting you see that I would like to extract two groups: the username part, and the domain name part. C# allows us to name each group to make things a little easier to read. We can use the ?<groupname> pattern to name each group.
A little rewrite can make our expression a lot easier to read. C# offers the “#” character to document our expressions in line.
static string validEmail = @"\b # Find a word boundary
(?<Username> # Begin group: Username
[a-zA-Z0-9._%+-]+ # Characters allowed in username, 1 or more
) # End group: Username
@ # The e-mail '@' character
(?<Domainname> # Begin group: Domain name
[a-zA-Z0-9.-]+ # Domain name(s), we include a dot so that
# mail.dijksterhuis is also possible
.[a-zA-Z]{2,4} # The top level domain can only be 4 characters
# So .info works, .telephone doesn't.
) # End group: Domain name
\b # Ending on a word boundary
";
Because we have added a lot of spaces and new lines to our expression we need to tell Regex about them by specifying the RegexOptions.Multiline and RegexOptions.IgnorePatternWhitespace options.
string testEmail = "martijn@dijksterhuis.org";
Regex TestValidEmail = new Regex(validEmail,RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
// Test the e-mail address
Match TestResult = TestValidEmail.Match(testEmail);
if (TestResult.Success)
{
Console.WriteLine("E-mail is: {0}@{1}",TestResult.Groups["Username"].Value,
TestResult.Groups["Domainname"].Value);
}
Conditional string replacement
The RegEx.Replace method allows you to use substitution parameters to change the original content around. In a previous post we looked at how we could swap two words around by using grouped patterns and the $1 and $2 conditional replacement names.
Regex Replacer = new Regex(@"(\w*) (\w*)"); string Input = "Molly Mallone"; string Output = Replacer.Replace(Input,"$2 $1"); Console.WriteLine(Output);
That is sufficient if you just want to move the data around a little, but it would be nice if you could make a replacement conditional on some external condition. The Regex.Replace method allows you to specify a MatchEvaluator which does just that. MatchEvaluator is a delegate which takes Match as a parameter and returns the replacement string.
Handy for example if you are cleaning up a mailing list and want to conditionally update some, but not all, e-mail addresses. In the following code example we know that mail.dijksterhuis.org is now served by smtp.dijksterhuis.org, so we want to move all those users to the new domain name and leave all other e-mail addresses the same.
using System;
using System.Text.RegularExpressions;
namespace RegularExpression
{
class MainClass
{
static string validEmail = @"\b # Find a word boundary
(?<Username> # Begin group: Username
[a-zA-Z0-9._%+-]+ # Characters allowed in username, 1 or more
) # End group: Username
@ # The e-mail '@' character
(?<Domainname> # Begin group: Domain name
[a-zA-Z0-9.-]+ # Domain name(s), we include a dot so that
# mail.dijksterhuis is also possible
.[a-zA-Z]{2,4} # The top level domain can only be 4 characters
# So .info works, .telephone doesn't.
) # End group: Domain name
\b
";
public static string UpdateDomainNames(Match match)
{
if (match.Groups["Domainname"].Value=="mail.dijksterhuis.org")
return match.Groups["Username"].Value + "@" + "smtp.dijksterhuis.org";
return match.Groups[0].Value; // The original
}
public static void Main(string[] args)
{
Regex TestValidEmail = new Regex(validEmail,RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
string[] MailingList = new string[] { "martijn@dijksterhuis.org",
"user@mail.dijksterhuis.org",
"willy@wortel.org"};
foreach(string email in MailingList)
{
// Conditionaly replace e-mail addresses
Console.WriteLine( TestValidEmail.Replace(email,UpdateDomainNames) );
}
}
}
}
Speeding up regular expressions by compiling them
Regular expressions can be quite slow and in another post I found that a simple string replacement routine was some 40 times faster than the equivalent regular expression. Often you will want to stick with the regular expression as it will save you many lines of coding.
As the RegEx class encounters your expressions it compiles them to an internal format. It steps through this internal format each time you query the expression. It is also possible compile your expression to MSIL (the byte code to which C# is compiled) directly. In the best possible scenario the Just-In-Time compiler then translates this MSIL code directly to machine code giving another speed boost to your expression.
A note of caution: According to the MSDN team the increase in speed can be up to 30% which is nice but certainly isn’t amazing.
You can do this by setting the RegexOptions.Compiled option when you create a new RegEx:
Regex theExpression = new Regex(thePattern,RegexOptions.Compiled);
The penalty for this is the time to compile the expression which can add significantly to your applications start-up time. So although “compiled” might sound faster it might actually be slower. This is best applied if you frequently use the expression and it has a very long lifetime.
The expression cache
If you use many regular expressions the RegEx cache is also an important factor in how quickly your code executes. Each time you define a regular expression the library needs to parse it. If you frequently use a small set of regular expressions they won’t be compiled over and over again, instead they come from a cache. You will find that .NET/C# caches the last 15 expressions. Any more and it will have to recompile them as it encounters them.
It is possible to expand the size of the cache by setting the Regex.CacheSize property to a higher value. This is probably best done after you made an overview of how many expressions are used by your code.
Compiling to an assembly
For compiling a regular expression to MSIL you need to pay a hefty price. But with your project about to ship it might be worthwhile to investigate taking your most frequently used regular expressions and putting them pre-compiled into a new assembly. The Regex.CompileToAssembly method performs this function. You will have to write a separate program to do the actual compilation, but once done you can link in the regular expression like any other assembly to your main application.
You can use the following class to create your own set of regular expressions and save them to a new assembly:
using System;
using System.Collections;
using System.Text.RegularExpressions;
namespace CompileExpression
{
class MainClass
{
// Add the expressions to the hash table
public static Hashtable TheExpressions = new Hashtable();
// CompileExpressions
public static void CompileExpressions(string AssemblyName)
{
// Reserve space for each expression
RegexCompilationInfo[] CI = new RegexCompilationInfo[TheExpressions.Count];
int Cnt = 0;
foreach(DictionaryEntry de in TheExpressions)
{
CI[Cnt++] = new RegexCompilationInfo((string)de.Value, // the reg. ex pattern
RegexOptions.Compiled, // Options to specify
(string)de.Key, // name of the pattern
"TheRegularExpressions", // name space name
true ); // Public?
}
// Create a new assembly name structure
System.Reflection.AssemblyName aName = new System.Reflection.AssemblyName( );
// Assign the name
aName.Name = AssemblyName;
// Compile all the regular expressions into the assembly
Regex.CompileToAssembly(CI, aName);
}
public static void Main(string[] args)
{
// Add two expressions to the collection
TheExpressions.Add("FindHTML",@"(<\/?[^>]+>)");
TheExpressions.Add("FindTCPIP", @"(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})");
// Compile them to my new assembly called "RegEx"
CompileExpressions("RegEx");
}
}
}
This will create a file called “RegEx.dll” in the home directory of your program. The next step is to verify if this works as advertised. Create a new project in Visual Studio and add a reference (in the Solution Explorer right click the name of the new project and click “Add Reference…” and navigate to where the RegEx.DLL file is located.
The following class will load the FindTCPIP expression from the DLL and execute it:
using System;
namespace TCPSolution
{
class Program
{
static void Main(string[] args)
{
TheRegularExpressions.FindTCPIP MatchTCP = new TheRegularExpressions.FindTCPIP();
if (MatchTCP.Match("10.0.0.6").Success)
{
Console.WriteLine("This works!");
}
}
}
}
Regular Expressions and Mono
I tested, prodded and played with the code for these regular expression posts on MonoDevelop and Mono. With the exception of the final “Compile to DLL” example. The code for that example compiles but on execution it will throw an “Not Implemented” exception in Regex.CompileToAssembly.
The end
This ends the mini series of three posts on regular expressions. I hope you have enjoyed them. The previous posts in this series are:
- Regular Expressions : The Basics. The theory behind regular expressions.
- Regular Expressions in C#: Practical Usage Examples of common usage.
Image through Flickr by Djenan
Tags: regex









Except where otherwise noted, content on this site is
March 12th, 2009 at 5:22 am
Nice series! Clean and easy to follow – even if I’ve used regular expressions for some time now, there’s always something new to pick up. One question though: At line 36 in the fourth listing you say “// The only drawback to named groups is that we need to look up their
// index offset in the group table.”
And then you use the index when fetching the named group values. Isn’t it the same thing to use the name of the group directly or am I missing something here, e.g:
your line(#28): return match.Groups[of_Username].Value
could be: return match.Groups["Username"].Value
or…..?
regards,
George
March 12th, 2009 at 10:15 am
Hi George,
Many thanks for the feedback! You are not missing things — I was so focused on using index entries that I forgot to check if a name would work as well. I will fix the example as this makes things that much cleaner.
Cheers,
Martijn
March 12th, 2009 at 10:42 pm
I like to learn something every day. And I never knew you could use “#” to comment regular expressions in conjunction with RegexOptions.IgnorePatternWhitespace. Thanks!
March 14th, 2009 at 1:34 am
“Because we have added a lot of spaces and new lines to our expression we need to tell Regex about them by specifying the RegexOptions.Multiline and RegexOptions.IgnorePatternWhitespace options.”
RegexOptions.Multiline doesn’t do what you think it does. Rather than specifying that the pattern is on multiple lines, it changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string.
RegexOptions.IgnorePatternWhitespace is enough to tell the engine to ignore the white-space – including new-lines – within your pattern.
April 12th, 2009 at 9:44 pm
[...] Advanced Regular Expressions in C# – neat tricks, including how to comment the expression [...]
June 16th, 2009 at 1:49 am
Wow, this is a sweet little intro to regex, good blog writing my friend!
Keep it up!! I am looking forward to other topics you will cover
August 27th, 2009 at 7:24 pm
Excellent, I’ve used them for years and yes after coming back to my code to modify many a time have I crossed my eyes trying to remember why and what I did.
This is extremely helpful at structuring code
Thank you,
October 3rd, 2009 at 12:50 am
Because we have added a lot of spaces and new lines to our expression we need to tell Regex about them by specifying the RegexOptions.Multiline and RegexOptions.IgnorePatternWhitespace options.
This is incorrect. RegexOptions.Multiline changes the behavior of the matching against the target string. It does not have anything to do with breaking the regular expression into multiple lines. For example, if your option is RegexOptions.Singleline, then ^ matches the start of the entire string, $ matches the end of the entire string, while in RegexOptions.Multiline, ^ matches start of the beginning of the line after the last carriage return, and $ matches before the carriage return.
All you need is RegexOptions.IgnorePatternWhitespace, if you broke your regular expression into multiple lines.
October 3rd, 2009 at 1:14 am
Just noticed Richard said the same thing
March 26th, 2010 at 8:37 pm
I’ve been using Regular Expressions for just about anything. I always knew they were a bit slower but never like 40 times slower!! Next time I’ll consider simple string operations before going the regex way!
Good article btw!
June 5th, 2010 at 3:49 am
Excellent article. The Conditional String Replacement section was exactly what I was looking for. Thanks.
January 21st, 2011 at 2:45 pm
Excellent post. Working on Regular expressions in .net. Found it helpful as a novice to start learning the Regular expressions.