<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Martijn's C# Programming Blog &#187; strip_tags</title>
	<atom:link href="http://www.dijksterhuis.org/tag/strip_tags/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dijksterhuis.org</link>
	<description>Information, news about programming in C#</description>
	<lastBuildDate>Fri, 07 Aug 2009 21:26:47 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Safely cleaning HTML with strip_tags in C#</title>
		<link>http://www.dijksterhuis.org/safely-cleaning-html-with-strip_tags-in-csharp/</link>
		<comments>http://www.dijksterhuis.org/safely-cleaning-html-with-strip_tags-in-csharp/#comments</comments>
		<pubDate>Wed, 04 Mar 2009 05:37:50 +0000</pubDate>
		<dc:creator>Martijn</dc:creator>
				<category><![CDATA[Beginner]]></category>
		<category><![CDATA[Learn C#]]></category>
		<category><![CDATA[c#]]></category>
		<category><![CDATA[html]]></category>
		<category><![CDATA[striptags]]></category>
		<category><![CDATA[strip_tags]]></category>

		<guid isPermaLink="false">http://www.dijksterhuis.org/?p=758</guid>
		<description><![CDATA[
One of my favorites in the PHP libraries is the strip_tags function. Not only does it neatly remove HTML from an input it also allows you to specify which tags should stay. This is great if you are allowing your visitors to apply some basic HTML tags to their comments. This post explores two issues: [...]<p>This is a post from <a href="http://www.dijksterhuis.org">Martijn's C# Coding Blog</a>. </p>
]]></description>
			<content:encoded><![CDATA[<p><img src="http://www.dijksterhuis.org/wp-content/uploads/2009/03/sub.jpg" alt="Removing unwanted tags with StripTags/strip_tags" title="Removing unwanted tags with StripTags/strip_tags" width="500" height="205" class="aligncenter size-full wp-image-763" /></p>
<p><em>One of my favorites in the PHP libraries is the strip_tags function. Not only does it neatly remove HTML from an input it also allows you to specify which tags should stay. This is great if you are allowing your visitors to apply some basic HTML tags to their comments. This post explores two issues: using C# to remove unwanted tags, and cleaning up unwanted attributes that might be hidden in the allowed tags.</em></p>
<p><span id="more-758"></span></p>
<p>I wanted to clean some comments posted to a website from unwanted HTML tags. The users are allowed to &lt;B&gt; or &lt;I&gt; and even &lt;A href=&#8221;"&gt;&lt;/a&gt; their posts but anything else must be stripped before it is posted to the site. I found several regular expressions for C# that allow you to strip HTML but these magically wipe all the HTML and leave nothing.</p>
<p>Below is the end result of of some hacking, and of course much love-hate with the regular expression library. </p>
<p><strong><span>string StripTags(string Input, string[] AllowedTags)</span></strong></p>
<p>The StripTags method takes an input string, and an array of allowed tags. It returns the input as a string, minus all not wanted tags.</p>
<pre class="brush: c#">
string test1 = StripTags(&quot;&lt;p&gt;George&lt;/p&gt;&lt;b&gt;W&lt;/b&gt;&lt;i&gt;Bush&lt;/i&gt;&quot;, new string[]{&quot;i&quot;,&quot;b&quot;});
string test2 = StripTags(&quot;&lt;p&gt;George &lt;img src=&#039;someimage.png&#039; onmouseover=&#039;someFunction()&#039;&gt;W &lt;i&gt;Bush&lt;/i&gt;&lt;/p&gt;&quot;, new string[]{&quot;p&quot;});
string test3 = StripTags(&quot;&lt;a href=&#039;http://www.dijksterhuis.org&#039;&gt;Martijn &lt;b&gt;Dijksterhuis&lt;/b&gt;&lt;/a&gt;&quot;, new string[]{&quot;a&quot;});
</pre>
<p></span>Using the above example code returns the following:<br />
<span><br />
</span></p>
<div style="margin-left: 40px;"><span>George&lt;b&gt;W&lt;/b&gt;&lt;i&gt;Bush&lt;/i&gt;</span><br />
<span>&lt;p&gt;George W Bush&lt;/p&gt;</span><br />
<span>&lt;a href=&#8217;http://www.dijksterhuis.org&#8217;&gt;Martijn Dijksterhuis&lt;/a&gt;</span></div>
<p><em><br />
</em><strong><span>string StripTagsAndAttributes(string Input, string[] AllowedTags)</span></strong></p>
<p>The above StripTags function is similar to the original PHP strip_tags function in having the same weakness: It is still possible for a malicious user to insert attributes into each of the tags. Think &#8220;style=&#8221; and &#8220;id=&#8221;. We would be somewhat saver if we cleaned these as well. The <em><span>StripTagsAndAttributes </span></em><span>method</span> does just that.</p>
<p>It first runs the input through <em>StripTags</em>, and for the remaining tags is strips out all but a restricted set of attributes.</p>
<pre class="brush: c#">
string test4 = &quot;&lt;a class=\&quot;classof69\&quot; onClick=&#039;crosssite.boom()&#039; href=&#039;http://www.dijksterhuis.org&#039;&gt;Martijn Dijksterhuis&lt;/a&gt;&quot;;
Console.WriteLine(StripTagsAndAttributes(test4, new string[]{&quot;a&quot;}));
</pre>
<p>That &#8220;OnClick&#8221; attribute looks mighty unsafe. Running the above string through </span><em><span>StripTagsAndAttributes </span></em><span>as in the example above returns: </span></p>
<div style="margin-left: 40px;"><span>&lt;a class=&#8221;classof69&#8243; href=&#8217;http://www.dijksterhuis.org&#8217;&gt;Martijn Dijksterhuis&lt;/a&gt;</span></div>
<p>This function probably needs some tuning if you want to allow, or restrict things even further.</p>
<p><strong>A word of caution</strong></p>
<p>Regular expressions are voodoo, very cool, but still voodoo. The above functions work for the tests I have applied to them, but your mileage may vary! If you have a special situation that doesn&#8217;t work leave a note below and maybe we can work out the problems.</p>
<p><strong>Credits</strong></p>
<p>The strip_tags function is of course inspired by the <a id="ixfp" title="PHP version" href="http://tw.php.net/manual/en/function.strip-tags.php">PHP version</a> , and a Javascript implementation thereof by <a id="evns" title="Kevin van Sonderveld" href="http://kevin.vanzonneveld.net/techblog/article/javascript_equivalent_for_phps_strip_tags/">Kevin van Sonderveld. </a>The attribute stripping routine is based on the regular expressions by <a id="q100" title="mdw252" href="http://tw.php.net/manual/en/function.strip-tags.php#88491">mdw252</a> in one of the strip_tags manual page comments.</p>
<p><strong>Source code</strong></p>
<p>The complete source code for the <em>StripTags</em> function and <span><em>StripTagsAndAttributes</em> function with my test code can be found below:</p>
<p></span></p>
<pre class="brush: c#">

using System;
using System.Text.RegularExpressions;

namespace StripHTML
{
	class MainClass
	{

        private static string ReplaceFirst(string haystack, string needle, string replacement)
        {
       		int pos = haystack.IndexOf(needle);
            if (pos &lt; 0) return haystack;
            return haystack.Substring(0,pos) + replacement + haystack.Substring(pos+needle.Length);
        }

		private static string ReplaceAll(string haystack, string needle, string replacement)
        {
             int pos;
			 // Avoid a possible infinite loop
             if (needle == replacement) return haystack;
              while((pos = haystack.IndexOf(needle))&gt;0)
                       haystack = haystack.Substring(0,pos) + replacement + haystack.Substring(pos+needle.Length);
                        return haystack;
        }		

		public static string StripTags(string Input, string[] AllowedTags)
		{
			Regex StripHTMLExp = new Regex(@&quot;(&lt;\/?[^&gt;]+&gt;)&quot;);
		    string Output = Input;

			foreach(Match Tag in StripHTMLExp.Matches(Input))
			{
				string HTMLTag = Tag.Value.ToLower();
				bool IsAllowed = false;

				foreach(string AllowedTag in AllowedTags)
				{
					int offset = -1;

					// Determine if it is an allowed tag
					// &quot;&lt;tag&gt;&quot; , &quot;&lt;tag &quot; and &quot;&lt;/tag&quot;
					if (offset!=0) offset = HTMLTag.IndexOf(&#039;&lt;&#039;+AllowedTag+&#039;&gt;&#039;);
					if (offset!=0) offset = HTMLTag.IndexOf(&#039;&lt;&#039;+AllowedTag+&#039; &#039;);
					if (offset!=0) offset = HTMLTag.IndexOf(&quot;&lt;/&quot;+AllowedTag);

					// If it matched any of the above the tag is allowed
					if (offset==0)
					{
					 	IsAllowed = true;
						break;
					}
				}

				// Remove tags that are not allowed
				if (!IsAllowed) Output = ReplaceFirst(Output,Tag.Value,&quot;&quot;);
			}

			return Output;
		}

		public static string StripTagsAndAttributes(string Input, string[] AllowedTags)
		{
			/* Remove all unwanted tags first */
			string Output = StripTags(Input,AllowedTags);

			/* Lambda functions */
			MatchEvaluator HrefMatch = m =&gt; m.Groups[1].Value + &quot;href..;,;..&quot; + m.Groups[2].Value;
			MatchEvaluator ClassMatch = m =&gt; m.Groups[1].Value + &quot;class..;,;..&quot; + m.Groups[2].Value;
			MatchEvaluator UnsafeMatch = m =&gt; m.Groups[1].Value + m.Groups[4].Value;

			/* Allow the &quot;href&quot; attribute */
			Output = new Regex(&quot;(&lt;a.*)href=(.*&gt;)&quot;).Replace(Output,HrefMatch);

			/* Allow the &quot;class&quot; attribute */
			Output = new Regex(&quot;(&lt;a.*)class=(.*&gt;)&quot;).Replace(Output,ClassMatch);

			/* Remove unsafe attributes in any of the remaining tags */
			Output = new Regex(@&quot;(&lt;.*) .*=(\&#039;|\&quot;&quot;|\w)[\w|.|(|)]*(\&#039;|\&quot;&quot;|\w)(.*&gt;)&quot;).Replace(Output,UnsafeMatch);

			/* Return the allowed tags to their proper form */
			Output = ReplaceAll(Output,&quot;..;,;..&quot;, &quot;=&quot;);

			return Output;
		}

		public static void Main(string[] args)
		{
			string test1 = StripTags(&quot;&lt;p&gt;George&lt;/p&gt;&lt;b&gt;W&lt;/b&gt;&lt;i&gt;Bush&lt;/i&gt;&quot;, new string[]{&quot;i&quot;,&quot;b&quot;});
			string test2 = StripTags(&quot;&lt;p&gt;George &lt;img src=&#039;someimage.png&#039; onmouseover=&#039;someFunction()&#039;&gt;W &lt;i&gt;Bush&lt;/i&gt;&lt;/p&gt;&quot;, new string[]{&quot;p&quot;});
			string test3 = StripTags(&quot;&lt;a href=&#039;http://www.dijksterhuis.org&#039;&gt;Martijn &lt;b&gt;Dijksterhuis&lt;/b&gt;&lt;/a&gt;&quot;, new string[]{&quot;a&quot;});

			Console.WriteLine(test1);
			Console.WriteLine(test2);
			Console.WriteLine(test3);

			string test4 = &quot;&lt;a class=\&quot;classof69\&quot; onClick=&#039;crosssite.boom()&#039; href=&#039;http://www.dijksterhuis.org&#039;&gt;Martijn Dijksterhuis&lt;/a&gt;&quot;;
			Console.WriteLine(StripTagsAndAttributes(test4, new string[]{&quot;a&quot;}));
		}
	}
</pre>
<p>Image credit: <a rel="nofollow" href="http://www.flickr.com/photos/jesper/">Jesper Rønn-Jensen&#8217;s</a></p>
<p>This is a post from <a href="http://www.dijksterhuis.org">Martijn's C# Coding Blog</a>. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.dijksterhuis.org/safely-cleaning-html-with-strip_tags-in-csharp/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>
