asp.nethtmlregexstringhtml-content-extraction

Extracting text fragment from a HTML body (in .NET)


I have an HTML content which is entered by user via a richtext editor so it can be almost anything (less those not supposed to be outside the body tag, no worries about "head" or doctype etc). An example of this content:

<h1>Header 1</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right><a href="x">A link here</a></div><hr />
<h1>Header 2</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right><a href="x">A link here</a></div><hr />

The trick is, I need to extract first 100 characters of the text only (HTML tags stripped). I also need to retain the line breaks and not break any word.

So the output for the above will be something like:

Header 1
Some text here

Some more text here

A link here

Header 2
Some text here

Some

It has 98 characters and line breaks are retained. What I can achieve so far is to strip the all HTML tags using Regex:

Regex.Replace(htmlStr, "<[^>]*>", "")

Then trim the length using Regex as well with:

Regex.Match(textStr, @"^.{1,100}\b").Value

My problem is, how to retaining the line break?. I get an output like:

Header 1
Some text hereSome more text here
A link here
Header 2
Some text hereSome more text

Notice the joining sentences? Perhaps someone can show me some other ways of solving this problem. Thanks!

Additional Info: My purpose is to generate plain text synopsis from a bunch of HTML content. Guess this will help clarify the this problem.


Solution

  • Well, I need to close this though not having the ideal solution. Since the HTML tags used in my app are very common ones (no tables, list etc) with little or no nesting, what I did is to preformat the HTML fragments before I save them after user input.

    Before I extract them out to be displayed as plain-text, use regex to remove the html tag and retain the line-break. Hardly any rocket science but works for me.