Home About

March 11th, 2009

Advanced Regular Expressions in C# - 9

Regular Expressions in C# - Advanced Topics
In this third and for now last post on using regular expressions we look at some advanced topics. When your expressions become more complicated they also become harder to understand so documenting them can help. And isn’t standard string replacement a little bit too basic? We also look at how speeding things up can improve your code’s efficiency.

In this post we look at three topics:

  1. Improving your code’s readability by documenting regular expressions
  2. Creating conditional string replacement by using MatchEvaluators
  3. Speeding up regular expressions by compiling them, caching them in memory and pre-compiling them to their own DLL.

If you are new to regular expressions in C# have a look at the theory of regular expression in Regular Expressions : The Basics. The second post Regular Expressions in C#: Practical Usage introduced the most common uses of regular expressions.

Documenting your Regular Expressions

Regular expressions can make for fine alphabet soup. The following expression validates an e-mail address and it does a good job at it. It is also very intimidating at first. So just imagine rereading your code after a few weeks, what is going on in there?

string validEmail = @"\b([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})\b";

With a little squinting you see that I would like to extract two groups: the username part, and the domain name part. C# allows us to name each group to make things a little easier to read. We can use the ?<groupname> pattern to name each group.

A little rewrite can make our expression a lot easier to read. C# offers the “#” character to document our expressions in line.

       static string validEmail = @"\b    # Find a word boundary
                       (?<Username>       # Begin group: Username
                       [a-zA-Z0-9._%+-]+  #  Characters allowed in username, 1 or more
                       )                  # End group: Username
                       @                  # The e-mail '@' character
                       (?<Domainname>     # Begin group: Domain name
                       [a-zA-Z0-9.-]+     #  Domain name(s), we include a dot so that
                                          #  mail.dijksterhuis is also possible
                       .[a-zA-Z]{2,4}     #  The top level domain can only be 4 characters
                                          #  So .info works, .telephone doesn't. 
                       )                  # End group: Domain name
                       \b                 # Ending on a word boundary
                       ";

Because we have added a lot of spaces and new lines to our expression we need to tell Regex about them by specifying the RegexOptions.Multiline and RegexOptions.IgnorePatternWhitespace options.

          string testEmail = "martijn@dijksterhuis.org";
          Regex TestValidEmail = new Regex(validEmail,RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
           
           // Test the e-mail address
           Match TestResult = TestValidEmail.Match(testEmail);
            
           if (TestResult.Success)
           {
                Console.WriteLine("E-mail is: {0}@{1}",TestResult.Groups["Username"].Value,
                                                         TestResult.Groups["Domainname"].Value);
           }

Conditional string replacement

The RegEx.Replace method allows you to use substitution parameters to change the original content around. In a previous post we looked at how we could swap two words around by using grouped patterns and the $1 and $2 conditional replacement names.

Regex Replacer = new Regex(@"(\w*) (\w*)");
string Input = "Molly Mallone";
string Output = Replacer.Replace(Input,"$2 $1");
Console.WriteLine(Output);

That is sufficient if you just want to move the data around a little, but it would be nice if you could make a replacement conditional on some external condition. The Regex.Replace method allows you to specify a MatchEvaluator which does just that. MatchEvaluator is a delegate which takes Match as a parameter and returns the replacement string.

Handy for example if you are cleaning up a mailing list and want to conditionally update some, but not all, e-mail addresses. In the following code example we know that mail.dijksterhuis.org is now served by smtp.dijksterhuis.org, so we want to move all those users to the new domain name and leave all other e-mail addresses the same.

using System;
using System.Text.RegularExpressions;

namespace RegularExpression
{
	class MainClass
	{

    	static string validEmail = @"\b   			# Find a word boundary
							  (?<Username>			# Begin group: Username
							  [a-zA-Z0-9._%+-]+     #  Characters allowed in username, 1 or more
							  )                     # End group: Username
							  @					    # The e-mail '@' character
							  (?<Domainname>        # Begin group: Domain name
							  [a-zA-Z0-9.-]+        #  Domain name(s), we include a dot so that
                                                    #  mail.dijksterhuis is also possible
							  .[a-zA-Z]{2,4}        #  The top level domain can only be 4 characters
													#  So .info works, .telephone doesn't. 
							  )                     # End group: Domain name
                              \b
							  ";
		
		public static string UpdateDomainNames(Match match)
		{
			if (match.Groups["Domainname"].Value=="mail.dijksterhuis.org")
			 return match.Groups["Username"].Value + "@" + "smtp.dijksterhuis.org";
			return match.Groups[0].Value; // The original
		}
		
		public static void Main(string[] args)
		{

		   Regex TestValidEmail = new Regex(validEmail,RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
		   
		   string[] MailingList = new string[] { "martijn@dijksterhuis.org",
												 "user@mail.dijksterhuis.org",
												 "willy@wortel.org"};

		   foreach(string email in MailingList)
		   {
				// Conditionaly replace e-mail addresses 
				Console.WriteLine( TestValidEmail.Replace(email,UpdateDomainNames) );
		   }
		
		}
	}
}

Speeding up regular expressions by compiling them

Regular expressions can be quite slow and in another post I found that a simple string replacement routine was some 40 times faster than the equivalent regular expression. Often you will want to stick with the regular expression as it will save you many lines of coding.

As the RegEx class encounters your expressions it compiles them to an internal format. It steps through this internal format each time you query the expression. It is also possible compile your expression to MSIL (the byte code to which C# is compiled) directly. In the best possible scenario the Just-In-Time compiler then translates this MSIL code directly to machine code giving another speed boost to your expression.

A note of caution: According to the MSDN team the increase in speed can be up to 30% which is nice but certainly isn’t amazing.

You can do this by setting the RegexOptions.Compiled option when you create a new RegEx:

Regex theExpression = new Regex(thePattern,RegexOptions.Compiled);

The penalty for this is the time to compile the expression which can add significantly to your applications start-up time. So although “compiled” might sound faster it might actually be slower. This is best applied if you frequently use the expression and it has a very long lifetime.

The expression cache

If you use many regular expressions the RegEx cache is also an important factor in how quickly your code executes. Each time you define a regular expression the library needs to parse it. If you frequently use a small set of regular expressions they won’t be compiled over and over again, instead they come from a cache. You will find that .NET/C# caches the last 15 expressions. Any more and it will have to recompile them as it encounters them.

It is possible to expand the size of the cache by setting the Regex.CacheSize property to a higher value. This is probably best done after you made an overview of how many expressions are used by your code.

Compiling to an assembly

For compiling a regular expression to MSIL you need to pay a hefty price. But with your project about to ship it might be worthwhile to investigate taking your most frequently used regular expressions and putting them pre-compiled into a new assembly. The Regex.CompileToAssembly method performs this function. You will have to write a separate program to do the actual compilation, but once done you can link in the regular expression like any other assembly to your main application.

You can use the following class to create your own set of regular expressions and save them to a new assembly:

using System;
using System.Collections;
using System.Text.RegularExpressions;

namespace CompileExpression
{
	class MainClass
	{
		// Add the expressions to the hash table 
	 	public static Hashtable TheExpressions = new Hashtable();

		// CompileExpressions
		public static void CompileExpressions(string AssemblyName)
		{
			// Reserve space for each expression
			RegexCompilationInfo[] CI = new RegexCompilationInfo[TheExpressions.Count];

			int Cnt = 0;
        	foreach(DictionaryEntry de in TheExpressions)
        	{
				CI[Cnt++] = new RegexCompilationInfo((string)de.Value,		  // the reg. ex pattern
				                                     RegexOptions.Compiled,   // Options to specify
				                                     (string)de.Key,		  // name of the pattern
				                                     "TheRegularExpressions", // name space name
				                                     true );                  // Public? 
        	}

		   // Create a new assembly name structure
		   System.Reflection.AssemblyName aName = new System.Reflection.AssemblyName( );

		   // Assign the name
  		   aName.Name = AssemblyName;

		   // Compile all the regular expressions into the assembly
  		   Regex.CompileToAssembly(CI, aName);
		}
		
		public static void Main(string[] args)
		{
			// Add two expressions to the collection
			TheExpressions.Add("FindHTML",@"(<\/?[^>]+>)");
			TheExpressions.Add("FindTCPIP", @"(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})");

			// Compile them to my new assembly called "RegEx"
			CompileExpressions("RegEx");
		}
	}
}

This will create a file called “RegEx.dll” in the home directory of your program. The next step is to verify if this works as advertised. Create a new project in Visual Studio and add a reference (in the Solution Explorer right click the name of the new project and click “Add Reference…” and navigate to where the RegEx.DLL file is located.

The following class will load the FindTCPIP expression from the DLL and execute it:

using System;

namespace TCPSolution
{
   class Program
   {
       static void Main(string[] args)
       {
           TheRegularExpressions.FindTCPIP MatchTCP = new TheRegularExpressions.FindTCPIP();

           if (MatchTCP.Match("10.0.0.6").Success)
           {
               Console.WriteLine("This works!");
           }
       }
   }
} 

Regular Expressions and Mono

I tested, prodded and played with the code for these regular expression posts on MonoDevelop and Mono. With the exception of the final “Compile to DLL” example. The code for that example compiles but on execution it will throw an “Not Implemented” exception in Regex.CompileToAssembly.

The end

This ends the mini series of three posts on regular expressions. I hope you have enjoyed them. The previous posts in this series are:

kick it on DotNetKicks.com

Image through Flickr by Djenan

Be Sociable, Share!

Tags:

9 Responses to “Advanced Regular Expressions in C#”

  1. georg Says:

    Nice series! Clean and easy to follow – even if I’ve used regular expressions for some time now, there’s always something new to pick up. One question though: At line 36 in the fourth listing you say “// The only drawback to named groups is that we need to look up their
    // index offset in the group table.”

    And then you use the index when fetching the named group values. Isn’t it the same thing to use the name of the group directly or am I missing something here, e.g:

    your line(#28): return match.Groups[of_Username].Value
    could be: return match.Groups[“Username”].Value

    or…..?

    regards,
    George

  2. Martijn Says:

    Hi George,

    Many thanks for the feedback! You are not missing things — I was so focused on using index entries that I forgot to check if a name would work as well. I will fix the example as this makes things that much cleaner.

    Cheers,
    Martijn

  3. Gareth Says:

    I like to learn something every day. And I never knew you could use “#” to comment regular expressions in conjunction with RegexOptions.IgnorePatternWhitespace. Thanks!

  4. Richard Says:

    “Because we have added a lot of spaces and new lines to our expression we need to tell Regex about them by specifying the RegexOptions.Multiline and RegexOptions.IgnorePatternWhitespace options.”

    RegexOptions.Multiline doesn’t do what you think it does. Rather than specifying that the pattern is on multiple lines, it changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string.

    RegexOptions.IgnorePatternWhitespace is enough to tell the engine to ignore the white-space – including new-lines – within your pattern.

  5. Link Love | Kaeli's Space Says:

    […] Advanced Regular Expressions in C# – neat tricks, including how to comment the expression […]

  6. Martin Sykora Says:

    Wow, this is a sweet little intro to regex, good blog writing my friend!

    Keep it up!! I am looking forward to other topics you will cover ;-)

  7. paul heintz Says:

    Excellent, I’ve used them for years and yes after coming back to my code to modify many a time have I crossed my eyes trying to remember why and what I did.

    This is extremely helpful at structuring code

    Thank you,

  8. Holystream Says:

    Because we have added a lot of spaces and new lines to our expression we need to tell Regex about them by specifying the RegexOptions.Multiline and RegexOptions.IgnorePatternWhitespace options.

    This is incorrect. RegexOptions.Multiline changes the behavior of the matching against the target string. It does not have anything to do with breaking the regular expression into multiple lines. For example, if your option is RegexOptions.Singleline, then ^ matches the start of the entire string, $ matches the end of the entire string, while in RegexOptions.Multiline, ^ matches start of the beginning of the line after the last carriage return, and $ matches before the carriage return.

    All you need is RegexOptions.IgnorePatternWhitespace, if you broke your regular expression into multiple lines.

  9. Holystream Says:

    Just noticed Richard said the same thing :)


Most popular

    Sorry. No data so far.

Recent Comments
  • ARS: great plugin! I love it! but, it will be so nice if you can add attribute ‘title’ as one of...
  • Nelson: Saved me from doing it myself. Good article.
  • andy: i am currently playing taiwanese server wow in 奈辛瓦里(PVP) and i would like to realm transfer to somewhere there...
  • berties: any english speaking playing on a taiwanese server?
  • web application development: has C# search volume really so constant over the years? really surprising.