I'm using pagedown editor. The code I'm using for gerating the preview is following:
$(document).ready(function () {
var previewConverter = Markdown.getSanitizingConverter();
var editor = new Markdown.Editor(previewConverter);
editor.run();
});
While I enter some text to the input:
the dynamically generated output preview will be as expected, and looks following:
The content (the pure entered text shown below) is then saved to database:
"http://www.google.com\n\n<script>alert('hi');</script>\n\n[google][4]\n\n\n [1]: http://www.google.com"
On the server side, before the page is rendered, I'm converting this fetched from database text, using this markdownsharp library v1.13.0.0. After conversion, I'm sanitizing the html using Jeff Atwood's code, which I've found here:
private static Regex _tags = new Regex("<[^>]*(>|$)",
RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);
private static Regex _whitelist = new Regex(@"
^</?(b(lockquote)?|code|d(d|t|l|el)|em|h(1|2|3)|i|kbd|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)>$|
^<(b|h)r\s?/?>$",
RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);
private static Regex _whitelist_a = new Regex(@"
^<a\s
href=""(\#\d+|(https?|ftp)://[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+)""
(\stitle=""[^""<>]+"")?\s?>$|
^</a>$",
RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);
private static Regex _whitelist_img = new Regex(@"
^<img\s
src=""https?://[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+""
(\swidth=""\d{1,3}"")?
(\sheight=""\d{1,3}"")?
(\salt=""[^""<>]*"")?
(\stitle=""[^""<>]*"")?
\s?/?>$",
RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);
/// <summary>
/// sanitize any potentially dangerous tags from the provided raw HTML input using
/// a whitelist based approach, leaving the "safe" HTML tags
/// CODESNIPPET:4100A61A-1711-4366-B0B0-144D1179A937
/// </summary>
public static string Sanitize(string html)
{
if (String.IsNullOrEmpty(html)) return html;
string tagname;
Match tag;
// match every HTML tag in the input
MatchCollection tags = _tags.Matches(html);
for (int i = tags.Count - 1; i > -1; i--)
{
tag = tags[i];
tagname = tag.Value.ToLowerInvariant();
if(!(_whitelist.IsMatch(tagname) || _whitelist_a.IsMatch(tagname) || _whitelist_img.IsMatch(tagname)))
{
html = html.Remove(tag.Index, tag.Length);
System.Diagnostics.Debug.WriteLine("tag sanitized: " + tagname);
}
}
return html;
}
The conversion and sanitization process is following::
var md = new MarkdownSharp.Markdown();
var unsafeHtml = md.Transform(content);
var safeHtml = Sanitize(unsafeHtml);
return new HtmlString(safeHtml);
unsafeHtml
contains
"<p>http://www.google.com</p>\n\n<script>alert('hi');</script>\n\n<p><a href=\"http://www.google.com\">google</a></p>\n"
safeHtml
contains
"<p>http://www.google.com</p>\n\nalert('hi');\n\n<p><a href=\"http://www.google.com\">google</a></p>\n"
This renders to:
So sanitization and the second link were converted as expected. Unfortunately, the first link is not a link anymore, just text. How to fix this ?
Maybe better approach is not to use server side conversion, but just use javascript to render the markdown text on the page ?
In Markdown.Converter.js
we can find _DoAutoLinks(text)
function. There is section which automatically add <
and >
around unadorned raw hyperlinks, and then autolink anything like <http://example.com>
. This is why
http://www.google.com
will be first converted to:
<http://www.google.com>
and then to:
<a href="http://www.google.com">http://www.google.com</a>
My temporary workaround is doing something similiar at the c# side:
var unsafeHtml = DoAutolinks(md.Transform(content));
private static string DoAutolinks(string content)
{
/* url pattern - from msdn.microsoft.com/en-us/library/ff650303.aspx */
const string url = @"(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?";
const string pattern = @"<p>(?<url>" + url + ")</p>";
var result = Regex.Replace(content, pattern, "<p><a href=\"${url}\">${url}</a></p>");
return result;
}
Should such functionality - responsible for unadorned links conversion, be included in markdownsharp ?