Home About

March 10th, 2009

Regular Expressions in C# – Practical Usage - 6

Regular Expression - Practical Usage

This is the second post in the C# regular expression series and it follows up on “Regular Expressions in C# – The Basics” which explained the theory behind Regular expressions in C#. In this post we look at how to make practical use of regular expressions in our C# code.

This post touches on four major regular expression subjects:

  • String Comparison – does a string contain a particular sub-string?
  • Splitting a string into segments – we will take an IPv4 address and retrieve its dotted components
  • Replacement – modifying an input string
  • Stricter input validation – how to harden your expressions

String Comparison – finding valid HTML tags

One of the essential functions of expressions are their ability to find if a string is contained inside another one. The RegEx.Matches method tests if a given string matches the pattern.

We start with a simple example: finding out where the letter “a” is mentioned in a sentence:

            string Input = "apples make for great party accessories";
            Regex FindA = new Regex("a");

            foreach(Match Tag in FindA.Matches(Input))
            {
                Console.WriteLine("Found 'a' at {0}",Tag.Index);
            }

That was almost too easy. Regular expressions really shine if you don’t know exactly what you are looking for but you can describe it. In the following example we will look for all valid HTML tags in an input string.

What is a valid HTML tag? <code>, </code>, <b>,<img src=”">, </br> are all valid HTML tags.

Regex HTMLTag = new Regex(@”(<\/?[^>]+>)”);

To break this down:

  1. All valid HTML tags start with a “<”
  2. They might or not have a forward slash (we need to escape the forward slash) \/?
  3. There is at least one or more characters which are not “>”
  4. The tag ends with a “>”

The following code example searches for all valid HTML tags in the input string:

using System;
using System.Text.RegularExpressions;

namespace RegularExpression
{
    class MainClass
    {
        public static void Main(string[] args)
        {
            Regex HTMLTag = new Regex(@"(<\/?[^>]+>)");

            string Input = "<b><i><a href='http://apple.com'>Ipod News</a></b></i>";
            
            foreach(Match Tag in HTMLTag.Matches(Input))
            {
                Console.WriteLine("Found {0}",Tag.Value);
            }
        }
    }
}

Resulting in:

Found <b>
Found <i>
Found <a href=’http://apple.com’>
Found </a>
Found </b>
Found </i>

Splitting a string into parts

Parentheses () not only allow you to group your expressions into parts they allow you to split a single string into multiple segments which we can inspect individually. To demonstrate we will use a regular expression to split an IPv4 address into its components.

A decimal TCP/IP address looks like XXX.XXX.XXX.XXX with X being a decimal number. Each column has at least 1 digit, and a maximum of 3. So a single column can be described as “(\d{1-3})“. There are four columns, each seperated by a dot. The dot (.) has a special meaning in regex so we need to escape it. (\.)

The Regex.Match method returns a new Match instance. We can now test Match.Success to see if the input string matched the TCP/IP address pattern. Through the Match.Groups property can we then extract each of the four IP address columns.The zero entry in the Groups property is alway the complete match, in this case “10.0.0.6″. The [1] entry contains the first groups contents, [2] the second etc.

            string IPMatchExp = @"(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})";
            Match theMatch  = Regex.Match("10.0.0.6",IPMatchExp);
            if (theMatch.Success)
            {
                Console.WriteLine("{0}.{1}.{2}.{3}",theMatch.Groups[1].Value,
                                                      theMatch.Groups[2].Value,
                                                      theMatch.Groups[3].Value,
                                                      theMatch.Groups[4].Value);
            }

String Replacement

Often is useful to manipulate a string, by replacing the matched pattern with something new. The RegEx.Replace method allows us to specify a pattern to look for and a replacement string.

The following example matches the last character and space following each word and replaces it with “b_”.

            Regex Replacer = new Regex(@"\w "); // Single [a-zA-Z] followed by a space
            string Input  = "ax bx sax dam pom";
            string Output = Replacer.Replace(Input,"b_"); // Replace all items found with a b and underscore
            Console.WriteLine(Output);

Substitution Patterns

What to do if you would like to flip parts of a string? C# offers several substitution patterns for this. Substitution patterns can only be used in a replacement string, and are used in combination with grouping.

They are useful if you would like to format the results of the match. A common task is to flip two words around. In the below example we flip the name “Molly Malone” into “Malone Molly”:

            Regex Replacer = new Regex(@"(\w*) (\w*)");
            string Input  = "Molly Mallone";
            string Output = Replacer.Replace(Input,"$2 $1");
            Console.WriteLine(Output);

The regular expression is defined as two groups of words (\w*) separated by a space. Each group can be referred to with a substitution pattern. $1 refers to the first group, $2 to the second (and if we had defined more $3 would be the third etc).

Input validation – we have to be more strict

Often we need to check if the data inputed or read from a file matches a definition so that we know its valid. But for this to work we need to ensure that our expressions only match a valid input. Many expressions of convenience are defined too loose. If we are to use them for input validation we need to harden them.

The pattern we used in an earlier example neatly broke down a valid IP address. But it wasn’t very strict and there are many combinations that would have matched that aren’t valid IP addresses. 999.999.999.999 is not a valid IPv4 address but it would have matched our pattern (@”(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})”). So we couldn’t have used it for testing for a valid IP address.

So what is a valid match? We need to define this first.

A valid IP address range is from 0.0.0.0 to 255.255.255.255 (with each column being represented by a byte).

At this point there are two things we can do: we can validate the results returned by our expressions with a few additional lines of C# code or we modifying our regular expression to become stricter. As this post is about regular expressions we will modify our expression to match only valid IP addresses.

How do we define valid ? 0,9,10,19,100,199,200,249,255 are all valid inputs for each column. 300 isn’t valid, and neither is 299. To keep things simple, we don’t allow 09 as a valid input.

  • Single digit: 0 – 9 :   [0-9]
  • Double digit: 10 – 99: [1-9][0-9]
  • Triple digit 1:  100 – 199:  1[0-9]{2}
  • Triple digit 2: and 200 – 249:  2[0-4][0-9]
  • Triple digit 3: 250 – 255 25[0-5]

The single ([0-9])and double digit ([1-9][0-9]) combinations can be combined into: [1-9]?[0-9]. (Read as: The first 1-9 is optional, occurs 0 or 1 time)

So a single column can be defined as: (([1-9]?[0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.) Note the “.” at the end.

On the final column we do not need a “dot”. We can save some space by repeating the first expression three times, but we need to write out the fourth in full. Thus our expressions becomes: ([1-9]?[0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.{3}([1-9]?[0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])

Not exactly easy to read, but lets test to see if it works as expected. The following example program tries all column combinations from 0-999

using System;
using System.Text.RegularExpressions;

namespace RegularExpression
{
    class MainClass
    {
        public static void Main(string[] args)
        {
            string IPTestExp = @"(([1-9]?[0-9]|1[0-9]{2}|2[0-4][0-9]|255[0-5])\.){3}([1-9]?[0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])";

            for (int Lp = 0; Lp < 999; Lp++)
            {
                string IPAddress = String.Format("{0}.{0}.{0}.{0}",Lp);
                
                if (Regex.Match(IPAddress,IPTestExp).Success)
                    Console.WriteLine("{0} is valid",IPAddress);
                else
                {
                    Console.WriteLine("{0} is invalid",IPAddress);
                    break;
                }
            }
          }
    }
}

For brevity the program ends at the first invalid combination. If we had let it run it would have shown 256-999 as invalid.

0.0.0.0 is valid
1.1.1.1 is valid
2.2.2.2 is valid

254.254.254.254 is valid
255.255.255.255 is valid
256.256.256.256 is invalid

This took a bit of work but we now have a single line test to see if a string is a valid IPv4 address.

Concluding

This ends the second post in this series. In the next post I will look at some advanced regular expression topics.

If you would like to read more on the theory behind regular expressions have a look at the first post in the series: Regular Expressions in C# – The Basics

Image credit: Tambako

Be Sociable, Share!

Tags:

6 Responses to “Regular Expressions in C# – Practical Usage”

  1. pedro Says:

    Why (]+>) instead of (]+>)?

  2. Martijn Says:

    Hi Pedro,

    There is no need for a final question mark. Similar to how “a” matches every “a” in the input string, the HTML Tag pattern as a whole matches to every valid HTML tag in the input string.

    You only need a question mark if in the pattern definition you might, or might not have a particular character.

    Cheers,
    Martijn

  3. David Kemp Says:

    In you first example,
    Regex HTMLTag = new Regex(@”(]+>)”);
    the parenthesis are redundant, and as you’re not using the captured group, you should either remove them, or replace then with the non-capturing group instruction
    eg
    Regex HTMLTag = new Regex(@”]+>”);
    or
    Regex HTMLTag = new Regex(@”(?:]+>)”);

  4. JM Says:

    nice !

  5. Scott Says:

    Thanks for the article. One thing I never knew is the string swap. Thanks.

  6. Nayana Says:

    HI,

    I am relatively new to regular expressions. I am still not clear on what does [^>] match in the following regex?
    Regex HTMLTag = new Regex(@”(]+>)”);
    string Input = “Ipod News“;


Most popular

    Sorry. No data so far.

Recent Comments
  • Juan Romero: Hi there, it’s a neat little class, but I believe you could do the same thing with the WebClient...
  • anthosh: Hey, THank you very much for your tutorial. It was awesome. But i have a problem that i am not able to...
  • bian: how to get passphase if i have encrypt and decrypt string?? Thanks alot
  • Michael: Hi, I really like your post, thanks a lot, it really helped clear up a few things I could not remember how...
  • Bharat Prajapati: i was trying to import keyword dictionary to this plugin which is in csv format, but i get an error...