Strip HTML Tags from a String using Regular Expressions 

Strip HTML Tags from a String using Regular Expressions 

Whilst writing some code to search forum posts I had the reason to remove all HTML tags from the forum posts before I could do a valid phrase search. I created the following function:

public static string StripDecodeHtml(string content)
{
  content = HttpUtility.HtmlDecode(content);
  string myTagPattern = @”<(w+)b[^>]*>(?<text>.*?)“;
  Regex myTagRegex = new Regex(myTagPattern, RegexOptions.Compiled
          |RegexOptions.IgnoreCase|RegexOptions.Singleline);
  // do until no more tags to match
  while (myTagRegex.IsMatch(content))
    content = myTagRegex.Replace(content, @”${text}”);
  // remove self closing tags completely
  Regex mySelfCloseRegex = new Regex(@”<.*?/>“,        RegexOptions.Compiled|RegexOptions.IgnoreCase);
  mySelfCloseRegex.Replace(content, string.Empty);

  return content;
}

Firstly the input string is decoded into HTML tags. Next the regular expression Regex object is setup to search for HTML opening and closing tags. The text between the opening and closing tags is placed in a named group called “text”, which is used later. The function then replaces the tags and text with just the text i.e. stripping out the HTML tags. This replace is done until no match is found since there could be tags embedded within tags e.g.

<body><div>A div embedded within the body<div><body>.

A second Regex object is created to match the self closing tags e.g. <img src=”../folder/image.gif” />. The self closing HTML is simply replaced by an empty string. The output is then returned.

Leave a Reply

Your email address will not be published. Required fields are marked *