Home About

March 4th, 2009

Safely cleaning HTML with strip_tags in C# - 8

Removing unwanted tags with StripTags/strip_tags

One of my favorites in the PHP libraries is the strip_tags function. Not only does it neatly remove HTML from an input it also allows you to specify which tags should stay. This is great if you are allowing your visitors to apply some basic HTML tags to their comments. This post explores two issues: using C# to remove unwanted tags, and cleaning up unwanted attributes that might be hidden in the allowed tags.

I wanted to clean some comments posted to a website from unwanted HTML tags. The users are allowed to <B> or <I> and even <A href=””></a> their posts but anything else must be stripped before it is posted to the site. I found several regular expressions for C# that allow you to strip HTML but these magically wipe all the HTML and leave nothing.

Below is the end result of of some hacking, and of course much love-hate with the regular expression library.

string StripTags(string Input, string[] AllowedTags)

The StripTags method takes an input string, and an array of allowed tags. It returns the input as a string, minus all not wanted tags.

string test1 = StripTags("<p>George</p><b>W</b><i>Bush</i>", new string[]{"i","b"});
string test2 = StripTags("<p>George <img src='someimage.png' onmouseover='someFunction()'>W <i>Bush</i></p>", new string[]{"p"});
string test3 = StripTags("<a href='http://www.dijksterhuis.org'>Martijn <b>Dijksterhuis</b></a>", new string[]{"a"});

Using the above example code returns the following:

George<b>W</b><i>Bush</i>
<p>George W Bush</p>
<a href=’http://www.dijksterhuis.org’>Martijn Dijksterhuis</a>


string StripTagsAndAttributes(string Input, string[] AllowedTags)

The above StripTags function is similar to the original PHP strip_tags function in having the same weakness: It is still possible for a malicious user to insert attributes into each of the tags. Think “style=” and “id=”. We would be somewhat saver if we cleaned these as well. The StripTagsAndAttributes method does just that.

It first runs the input through StripTags, and for the remaining tags is strips out all but a restricted set of attributes.

string test4 = "<a class=\"classof69\" onClick='crosssite.boom()' href='http://www.dijksterhuis.org'>Martijn Dijksterhuis</a>";
Console.WriteLine(StripTagsAndAttributes(test4, new string[]{"a"}));

That “OnClick” attribute looks mighty unsafe. Running the above string through StripTagsAndAttributes as in the example above returns:

<a class=”classof69″ href=’http://www.dijksterhuis.org’>Martijn Dijksterhuis</a>

This function probably needs some tuning if you want to allow, or restrict things even further.

A word of caution

Regular expressions are voodoo, very cool, but still voodoo. The above functions work for the tests I have applied to them, but your mileage may vary! If you have a special situation that doesn’t work leave a note below and maybe we can work out the problems.

Credits

The strip_tags function is of course inspired by the PHP version , and a Javascript implementation thereof by Kevin van Sonderveld. The attribute stripping routine is based on the regular expressions by mdw252 in one of the strip_tags manual page comments.

Source code

The complete source code for the StripTags function and StripTagsAndAttributes function with my test code can be found below:


using System;
using System.Text.RegularExpressions;

namespace StripHTML
{
	class MainClass
	{
		
        private static string ReplaceFirst(string haystack, string needle, string replacement)
        {
       		int pos = haystack.IndexOf(needle);
            if (pos < 0) return haystack;
            return haystack.Substring(0,pos) + replacement + haystack.Substring(pos+needle.Length);
        }

		private static string ReplaceAll(string haystack, string needle, string replacement)
        {
             int pos;
			 // Avoid a possible infinite loop
             if (needle == replacement) return haystack;
              while((pos = haystack.IndexOf(needle))>0)
                       haystack = haystack.Substring(0,pos) + replacement + haystack.Substring(pos+needle.Length);
                        return haystack;
        }		

		public static string StripTags(string Input, string[] AllowedTags)
		{
			Regex StripHTMLExp = new Regex(@"(<\/?[^>]+>)");
		    string Output = Input;

			foreach(Match Tag in StripHTMLExp.Matches(Input))
			{
				string HTMLTag = Tag.Value.ToLower();
				bool IsAllowed = false;
				
				foreach(string AllowedTag in AllowedTags)
				{
					int offset = -1;

					// Determine if it is an allowed tag 
					// "<tag>" , "<tag " and "</tag" 
					if (offset!=0) offset = HTMLTag.IndexOf('<'+AllowedTag+'>');
					if (offset!=0) offset = HTMLTag.IndexOf('<'+AllowedTag+' ');
					if (offset!=0) offset = HTMLTag.IndexOf("</"+AllowedTag);

					// If it matched any of the above the tag is allowed
					if (offset==0)
					{
					 	IsAllowed = true;
						break;
					}
				}

				// Remove tags that are not allowed
				if (!IsAllowed) Output = ReplaceFirst(Output,Tag.Value,"");
			}

			return Output;
		}

		public static string StripTagsAndAttributes(string Input, string[] AllowedTags)
		{
			/* Remove all unwanted tags first */
			string Output = StripTags(Input,AllowedTags);

			/* Lambda functions */
			MatchEvaluator HrefMatch = m => m.Groups[1].Value + "href..;,;.." + m.Groups[2].Value;
			MatchEvaluator ClassMatch = m => m.Groups[1].Value + "class..;,;.." + m.Groups[2].Value;
			MatchEvaluator UnsafeMatch = m => m.Groups[1].Value + m.Groups[4].Value;
			
			/* Allow the "href" attribute */
			Output = new Regex("(<a.*)href=(.*>)").Replace(Output,HrefMatch);

			/* Allow the "class" attribute */
			Output = new Regex("(<a.*)class=(.*>)").Replace(Output,ClassMatch);

			/* Remove unsafe attributes in any of the remaining tags */
			Output = new Regex(@"(<.*) .*=(\'|\""|\w)[\w|.|(|)]*(\'|\""|\w)(.*>)").Replace(Output,UnsafeMatch);

			/* Return the allowed tags to their proper form */
			Output = ReplaceAll(Output,"..;,;..", "=");
			
			return Output;
		}
			

		public static void Main(string[] args)
		{
			string test1 = StripTags("<p>George</p><b>W</b><i>Bush</i>", new string[]{"i","b"});
			string test2 = StripTags("<p>George <img src='someimage.png' onmouseover='someFunction()'>W <i>Bush</i></p>", new string[]{"p"});
			string test3 = StripTags("<a href='http://www.dijksterhuis.org'>Martijn <b>Dijksterhuis</b></a>", new string[]{"a"});
			
			Console.WriteLine(test1);
			Console.WriteLine(test2);
			Console.WriteLine(test3);

			string test4 = "<a class=\"classof69\" onClick='crosssite.boom()' href='http://www.dijksterhuis.org'>Martijn Dijksterhuis</a>"; 
			Console.WriteLine(StripTagsAndAttributes(test4, new string[]{"a"}));
		}
	}

Image credit: Jesper Rønn-Jensen’s

Be Sociable, Share!

Tags: , , ,

8 Responses to “Safely cleaning HTML with strip_tags in C#”

  1. Mathias Says:

    Hi Martijn,

    Congrats for the nice post.

    I have a problem when i have multipe attributes. Only the last attribute is stripped.

    code:
    string test = “test paragraph”;
    Console.WriteLine(StripTagsAndAttributes(test, new string[] { “p” }));

    Rgds

  2. Mathias Says:

    Solved it quick and dirty

    string oldOutput;
    do
    {
    oldOutput = Output;
    /* Remove unsafe attributes in any of the remaining tags */
    Output = new Regex(@”()”).Replace(Output, UnsafeMatch);
    } while (oldOutput != Output);

    rgds,
    Mathias

  3. Mathias Says:

    Hi Martijn,

    I found another problem: an attribute where the value contains ‘-‘ then only the part before ‘-‘ will be removed.

    rgds,
    Mathias

  4. Ben Drury Says:

    Great procedure, thanks for developing. I’m Having issues with the MatchEvaluator in attribute strip.

    ********************
    Description: An error occurred during the compilation of a resource required to service this request. Please review the following specific error details and modify your source code appropriately.

    Compiler Error Message: CS1525: Invalid expression term ‘>’

    Source Error:

    Line 76: MatchEvaluator HrefMatch = m => m.Groups[1].Value + “href..;,;..” + m.Groups[2].Value;

    ****************************
    Any ideas?

    Cheers,
    Ben

  5. Matt B Says:

    Hi Martijn.
    Thanks for the class, very useful. I have, however, found an issue:

    use this as a string: “any old text goes here”

    my allowed html elements are: p, i, b, h1 (i.e. h1 tags ARE allowed)

    When I pass them through to the StripTagsAndAttributes method, I get the result:

    “any old text goes here”

    I know the stripping of attributes is responsible for this and I should use the StripTags method instead but if h1 is in my allowed list shouldn’t the StripTagsAndAttributes method ignore all attributes associated with this tag?

    Cheers

  6. Matt B Says:

    Apologies it seems some of my HTML was stripped out :-)

    here are the corrections:

    1) use this as a string: h1 style=”clear:none” The Miners Strike of 1984 h1 (I have removed the angular brackets but you get the idea)

    2) I get the result:

    h1:none” The Miners Strike of 1984 /h1

    Cheers

  7. mikhail Says:

    > Ben Drury

    this is a new Csharp syntax, try to replace

    MatchEvaluator HrefMatch = m => m.Groups[1].Value + “href..;,;..” + m.Groups[2].Value;
    MatchEvaluator ClassMatch = m => m.Groups[1].Value + “class..;,;..” + m.Groups[2].Value;
    MatchEvaluator UnsafeMatch = m => m.Groups[1].Value + m.Groups[4].Value;

    for this

    MatchEvaluator HrefMatch = delegate(Match m) { return (m.Groups[1].Value + “href..;,;..” + m.Groups[2].Value); };
    MatchEvaluator ClassMatch = delegate(Match m) { return (m.Groups[1].Value + “class..;,;..” + m.Groups[2].Value); };
    MatchEvaluator UnsafeMatch = delegate(Match m) { return (m.Groups[1].Value + m.Groups[4].Value); };

  8. Miguel Says:

    Hello,

    I have been using this code and I really like it.
    I wonder if you ever did the following:

    A method that creates an excerpt of a string containing HTML code without breaking tags.

    Thanks,
    Miguel


Most popular

    Sorry. No data so far.

Recent Comments
  • ARS: great plugin! I love it! but, it will be so nice if you can add attribute ‘title’ as one of...
  • Nelson: Saved me from doing it myself. Good article.
  • andy: i am currently playing taiwanese server wow in 奈辛瓦里(PVP) and i would like to realm transfer to somewhere there...
  • berties: any english speaking playing on a taiwanese server?
  • web application development: has C# search volume really so constant over the years? really surprising.