March 9th, 2009
Regular Expressions in C# – The Basics - 7
If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!

One of the most common coding tasks is to take an input, munch it around and turn it into something different altogether. Are you looking for FedEx numbers in a text file? Do you want to replace “love” with “hate” in your source files? Is a string a valid e-mail address? Problems like these can be solved by applying regular expressions, or “regex” for short.
Introduction
This post explores the basic theory of expressions. If you are already familiar with them and want to know how to use them in your own C# programs have a look at the next post “Regular Expressions in C# – Practical Applications”
Expressions offer a method of describing and testing for particular combinations of characters in a string. A simple regular expression can often save you from having to write many lines of regular code.
- Are you looking for the characters “car” in “cartoon”, “carbonate” or “carton” ?
- Do you want to only match when the word “car” is standing by itself as in “car sales for 2009″ ?
- Or only return true when the car is red or blue ? “blue car”/ “red car” / “green car”
In C# expressions are provided by the RegEx class in the System.Text.RegularExpressions namespace.
The expressions themselves are more or less standard between computer languages. You can often take an expression from another language and with a little or no work apply them to your C# code. If you are not familiar with them yet you should consider learning to use them.
What can you use regular expressions for?
- Data capture: split a string into multiple fields which you can manipulate. 13-Jan-2006 becomes (day,month,year)
- Data input validation: Check if the input followed the required formatting rules. For example test if a valid telephone number was entered.
- String comparison: Does A exist in B?
- String replacement: Replace “foo” with “bar”
- Code size reduction: One line of regular expression code can replace large amount of dedicated code
When not to use regular expressions?
Don’t use them when speed is of the essence. Expressions have a serious drawback in that they can be slow to execute. If you are concerned about optimizing a part of your code it can be worthwhile to write your own replacement. In a previous post I noticed that a simple string replacement routine was 40 times faster than the regular expression equivalent.
The basics
To understand expressions we need a little bit of theory. This bit explains all the main operators and how to use them.
Literal characters
The most basic expression contains a single character. If we define “c” as the expression and test it against “car company” it will match against the “c” in “car”. If we ask the RegEx class to search again it will match against the “c” in “company”.
Several characters have a special meaning: ?, +, *, \, [, ( , ), ], {, }, . (dot) and ^
If we want to include them we need to escape them first using a backslash:
- 10 * 10 = 100 wrong
- 10 \* 10 = 100 ok
Normally when parsing strings C# will try to break down escaped sequences such as \n,\r etc. Expression statements usually contain many backslash operators. By adding the “@” string literal the compiler will not inspect the string too much and take it literally instead.
Character Sets
Character sets allow us to limit the characters that can match. Say for example we want to use just the numbers 0-9: [0-9] , or the characters a-z & A-Z: [a-zA-Z]. A character set only matches against a single character, so the following doesn’t work: “c[a-z]kie” matches against “cokie” but not “cookie”.
You can also define your own sets. If you are matching a date, a date separator can be a defined as a space, dash or slash: [ -/]
Many character sets are used so often that they have been given their own shorthands:
- \w matches any word character [a-z,A-Z]
- \s matches any whitespace (space, tab)
- \d matches against any digit [0-9]
For a longer list of the available short hands have a look at my C# Regular Expression Cheat Sheet .
The Dot is special
The dot “.” matches against any character, except for line breaks. You should use it sparingly as it can introduce unwanted results. Often it is better to be more specific, using \w or \d, or a character set that limits the set of possible characters.
- “g..gle” matches “google”, “gaagle”,”g%$gle” and much more.
- “\d\d.\d\d.\d\d” matches a valid date such as “12-08-99″ and “12/08/99″ but also to an invalid date: “12508799″
Creating alternatives using the boolean “or”
A vertical bar separates (|) alternatives, so “red|blue car” would match either a red or blue car. Written in C# code:
if (Regex.Match(“blue car”,”blue|red car”).Success)
Console.WriteLine(“Matches!”);
You can add as many alternatives as you would like, so “red|blue|purple|yellow car” are all possible.
Grouping with parentheses ()
Parentheses () make it easier to group things together. So if you would like to match for either “color” or “colour” you could write the word “color” (or “colour”) as one of:
- col(o|ou)r
- (color|colour)
Repetition
A repetition quantifier specifies how often a preceding element is allowed to repeat.
? |
A question mark indicates zero or one of the preceding element. For example “S?DRAM” matches “SDRAM” and “DRAM” |
* |
The asterisk indicates there are zero or more of the preceding element. For example, ab*c matches “ac“, “abc“, “abbc“, “abbbc“, and so on. |
+ |
The plus sign indicates that there is one or more of the preceding element. For example, ab+c matches “abc“, “abbc“, “abbbc“, and so on, but not “ac“. |
| {n}{n,}{n,m} | If you would like to match an exact number of times use {n}, for at least n matches use {n,}. For at least n matches, and more than m use {n,m} |
To give some examples:
- \d{1,3} reads as “a decimal digit (0-9)”, minimum of 1, maximum of 3
- [az]+ reads as “one or more of a-z”, “abc” matches, and so does “axxxz”
In the following example “aab” matches, but so does “aaab”.
if (Regex.Match(“aab”,”a{2,3}b”).Success)
Console.WriteLine(“Matches!”);
else
Console.WriteLine(“No Match!”);
Repetition is useful for testing if an input matches a required pattern. If you need to test for a telephone number formatted as : XXX-XXXX you could write this as \d{3}[-]\d{4}.
Lazy and Greedy matching
All the above repetition operators are “greedy”, they match to the longest possible string they can find.
- a[bz]+z against “abcbzcdze” returns “abcbzcdz“
- <a.+> against "<a href='index.php'>Beginning</a>" matches everything, instead of just the opening <a href"">.
To avoid this we can apply “lazy” matching instead. In a lazy match, as soon as it finds a match the parser stops and returns the result. You can make a match lazy by simply adding a question mark:
- a[bz]+?z against “abcbzcdze” returns “abcbz“
- <a.+?> against "<a href='index.php'>Beginning</a>" returns <a href='index.php'>.
Anchoring
All the above examples didn’t care where in the string the match was made. You could also use them repeatedly to find more instances of the match in the input string. Anchoring allows you to match only those strings that are close to the beginning and/or end.
- ^string reads as: only match if “string” is at the beginning of the input. The “^” indicates the beginning. So “string of wool” matches, but “woolly string” doesn’t.
- string$ reads as: only match if “string” is at the end of the input. Here the “$” indicates the end. In this case “string of wool” can’t match, but “woolly string” can.
- $string^ reads as: only match if “string” is the whole pattern. The “s” comes as the first character, and the “g” as the last. So only “string” can match this pattern.
This ends the theoretical introduction to Regular Expressions — see also the next post “Regular Expressions in C# – Practical Applications” .
Image credit: Sarae
Tags: regex










Except where otherwise noted, content on this site is
March 9th, 2009 at 4:52 pm
I realy like your explanation, simple and clear.
March 10th, 2009 at 1:05 pm
Just my level. Just the basics.
Never used it in C#, but it is incredibly useful in almost every language (just some syntax differences).
March 10th, 2009 at 3:36 pm
Thank you both for your comments ! Trying to keep it simple with regular expressions turned out to be quite a challenge.
April 6th, 2009 at 5:37 pm
I had always struggled to understand regular expressions.But this article is of great help to understand them.Thanks a lot.
April 12th, 2009 at 8:04 pm
Thank you, interesting course on regular expressions. Not a long ago a task appeared to on by him.
October 3rd, 2009 at 1:12 am
I just picked this out randomly, but one of your examples is inaccurate:
* a[bz]+z against “abcbzcdze” returns “abcbzcdz“
it does not return abcdzcdz at all. The expression actually means something like this:
Matches a and any b or z that comes after it, one or more times, then z at the end.
So, when it matches against “abcbzcdze”, it finds a, then find b, then but it cannot find z as required. So, it doesn’t return anything!
a better expression should be: a[a-z]+?z or a.+?z if you want to return a[a-z]+?z, or a[a-z]+z or a.+z if you want to return abcbzcdz
November 11th, 2009 at 4:48 pm
[...] http://www.dijksterhuis.org/regular-expressions-in-csharp-the-basics/ [...]