Home About

March 9th, 2009

Regular Expressions in C# – The Basics - 7

Regular Expressions in C#

One of the most common coding tasks is to take an input, munch it around and turn it into something different altogether. Are you looking for FedEx numbers in a text file? Do you want to replace “love” with “hate” in your source files? Is a string a valid e-mail address? Problems like these can be solved by applying regular expressions, or “regex” for short.

Introduction

This post explores the basic theory of expressions. If you are already familiar with them and want to know how to use them in your own C# programs have a look at the next post “Regular Expressions in C# – Practical Applications

Expressions offer a method of describing and testing for particular combinations of characters in a string. A simple regular expression can often save you from having to write many lines of regular code.

  • Are you looking for the characters “car”  in “cartoon”, “carbonate” or  “carton” ?
  • Do you want to only match when the word “car” is standing by itself as in  “car sales for 2009″ ?
  • Or only return true when the car is red or blue ? “blue car”/ “red car” / “green car”

In C# expressions are provided by the RegEx class in the System.Text.RegularExpressions namespace.

The expressions themselves are more or less standard between computer languages. You can often take an expression from another language and with a little or no work apply them to your C# code. If you are not familiar with them yet you should consider learning to use them.

Regular Expressions to the rescue

Regular Expressions to the rescue

What can you use regular expressions for?

  • Data capture: split a string into multiple fields which you can manipulate. 13-Jan-2006 becomes (day,month,year)
  • Data input validation: Check if the input followed the required formatting rules. For example test if a valid telephone number was entered.
  • String comparison: Does A exist in B?
  • String replacement: Replace “foo” with “bar”
  • Code size reduction: One line of regular expression code can replace large amount of dedicated code

When not to use regular expressions?

Don’t use them when speed is of the essence. Expressions have a serious drawback in that they can be slow to execute. If you are concerned about optimizing a part of your code it can be worthwhile to write your own replacement. In a previous post I noticed that a simple string replacement routine was 40 times faster than the regular expression equivalent.

The basics

To understand expressions we need a little bit of theory. This bit explains all the main operators and how to use them.

Literal characters

The most basic expression contains a single character. If we define “c” as the expression and test it against “car company” it will match against the “c” in “car”. If we ask the RegEx class to search again it will match against the “c” in “company”.

Several characters have a special meaning: ?, +, *, \, [, ( , ), ], {, }, . (dot) and ^

If we want to include them we need to escape them first using a backslash:

  • 10 * 10 = 100 wrong
  • 10 \* 10 = 100 ok

Normally when parsing strings C# will try to break down escaped sequences such as \n,\r etc. Expression statements usually contain many backslash operators. By adding the “@” string literal the compiler will not inspect the string too much and take it literally instead.

string exampleLiteral = @”10 \* 10 = 100″;

Character Sets

Character sets allow us to limit the characters that can match. Say for example we want to use just the numbers 0-9: [0-9] , or the characters a-z & A-Z: [a-zA-Z]. A character set only matches against a single character, so the following doesn’t work: “c[a-z]kie” matches against “cokie” but not “cookie”.

You can also define your own sets. If you are matching a date, a date separator can be a defined as a space, dash or slash: [ -/]

Many character sets are used so often that they have been given their own shorthands:

  • \w matches any word character [a-z,A-Z]
  • \s matches any whitespace (space, tab)
  • \d matches against any digit [0-9]

For a longer list of the available short hands have a look at my C# Regular Expression Cheat Sheet .

The Dot is special

The dot “.” matches against any character, except for line breaks. You should use it sparingly as it can introduce unwanted results. Often it is better to be more specific, using \w or \d, or a character set that limits the set of possible characters.

  • “g..gle” matches “google”, “gaagle”,”g%$gle” and much more.
  • “\d\d.\d\d.\d\d” matches a valid date such as “12-08-99″  and “12/08/99″ but also to an invalid date: “12508799”

Creating alternatives using the boolean “or”

A vertical bar separates (|) alternatives, so “red|blue car” would match either a red or blue car. Written in C# code:

if (Regex.Match(“blue car”,”blue|red car”).Success)
Console.WriteLine(“Matches!”);

You can add as many alternatives as you would like, so “red|blue|purple|yellow car” are all possible.

Grouping with parentheses ()

Parentheses () make it easier to group things  together. So if you would like to match for either “color” or “colour” you could write the word “color” (or “colour”) as one of:

  • col(o|ou)r
  • (color|colour)

Repetition

A repetition quantifier specifies how often a preceding element is allowed to repeat.

? A question mark indicates zero or one of the preceding element. For example “S?DRAM” matches “SDRAM” and “DRAM”
* The asterisk indicates there are zero or more of the preceding element. For example, ab*c matches “ac“, “abc“, “abbc“, “abbbc“, and so on.
+ The plus sign indicates that there is one or more of the preceding element. For example, ab+c matches “abc“, “abbc“, “abbbc“, and so on, but not “ac“.
{n}{n,}{n,m} If you would like to match an exact number of times use {n}, for at least n matches use {n,}. For at least n matches, and more than m use {n,m}

To give some examples:

  • \d{1,3} reads as “a decimal digit (0-9)”, minimum of 1, maximum of 3
  • [az]+ reads as “one or more of a-z”, “abc” matches, and so does “axxxz”

In the following example “aab” matches, but so does “aaab”.

// {a2,3}b reads as: 2 or 3 times a, followed by a b
if (Regex.Match(“aab”,”a{2,3}b”).Success)
Console.WriteLine(“Matches!”);
else
Console.WriteLine(“No Match!”);

Repetition is useful for testing if an input matches a required pattern. If you need to test for a telephone number formatted as : XXX-XXXX you could write this as \d{3}[-]\d{4}.

Lazy and Greedy matching

All the above repetition operators are “greedy”, they match to the longest possible string they can find.

  • a[bz]+z against “abcbzcdze” returns “abcbzcdz
  • <a.+> against "<a href='index.php'>Beginning</a>" matches everything, instead of just the opening <a href"">.

To avoid this we can apply “lazy” matching instead. In a lazy match, as soon as it finds a match the parser stops and returns the result. You can make a match lazy by simply adding a question mark:

  • a[bz]+?z against “abcbzcdze” returns “abcbz
  • <a.+?> against "<a href='index.php'>Beginning</a>" returns <a href='index.php'>.

Anchoring

All the above examples didn’t care where in the string the match was made. You could also use them repeatedly to find more instances of the match in the input string. Anchoring allows you to match only those strings that are close to the beginning and/or end.

  • ^string reads as: only match if “string” is at the beginning of the input. The “^” indicates the beginning. So “string of wool” matches, but “woolly string” doesn’t.
  • string$ reads as: only match if “string” is at the end of the input. Here the “$” indicates the end. In this case “string of wool” can’t match, but “woolly string” can.
  • $string^ reads as: only match if “string” is the whole pattern. The “s” comes as the first character, and the “g” as the last. So only “string” can match this pattern.

This ends the theoretical introduction to Regular Expressions — see also the next post “Regular Expressions in C# – Practical Applications” .

Image credit: Sarae

Be Sociable, Share!

Tags:

7 Responses to “Regular Expressions in C# – The Basics”

  1. Itay Says:

    I realy like your explanation, simple and clear.

  2. Peter Says:

    Just my level. Just the basics.
    Never used it in C#, but it is incredibly useful in almost every language (just some syntax differences).

  3. Martijn Says:

    Thank you both for your comments ! Trying to keep it simple with regular expressions turned out to be quite a challenge.

  4. Nayana Says:

    I had always struggled to understand regular expressions.But this article is of great help to understand them.Thanks a lot.

  5. Alexey Says:

    Thank you, interesting course on regular expressions. Not a long ago a task appeared to on by him.

  6. Holystream Says:

    I just picked this out randomly, but one of your examples is inaccurate:

    * a[bz]+z against “abcbzcdze” returns “abcbzcdz“

    it does not return abcdzcdz at all. The expression actually means something like this:
    Matches a and any b or z that comes after it, one or more times, then z at the end.

    So, when it matches against “abcbzcdze”, it finds a, then find b, then but it cannot find z as required. So, it doesn’t return anything!

    a better expression should be: a[a-z]+?z or a.+?z if you want to return a[a-z]+?z, or a[a-z]+z or a.+z if you want to return abcbzcdz

  7. Regular Expression in C# -Validating Simple E-mail address « All About .NET Says:

    […] http://www.dijksterhuis.org/regular-expressions-in-csharp-the-basics/ […]


Most popular
Recent Comments
  • ARS: great plugin! I love it! but, it will be so nice if you can add attribute ‘title’ as one of...
  • Nelson: Saved me from doing it myself. Good article.
  • andy: i am currently playing taiwanese server wow in 奈辛瓦里(PVP) and i would like to realm transfer to somewhere there...
  • berties: any english speaking playing on a taiwanese server?
  • web application development: has C# search volume really so constant over the years? really surprising.