In our enterprise software we allow customers to supply their own HTML to customize the ‘Contact’ and ‘Legal’ pages of the app. Its a nice feature, but as our app knows nothing specific about the app which actually provides the HTML, I was wondering how I would approach such a problem. I have read some blog articles, SO posts and watched some videos but those only explain the danger of HTML injection or how to do it with createElement
or innerHTML
or other direct approaches.
I am looking for the most safe approach to displaying HTML I have no direct control over. Any article or library would be greatly appreciated.
You should sanitize the HTML code entered by your users.
One way to do it is by defining a whitelist: you define a list of tags (and for each one of them you can also define a list of allowed attributes) which are allowed to be in the output HTML. Every other tag which is not explicitly allowed will be removed by the sanitization.
There are plenty of whitelist already available out there. You can use the one for the TinyMCE HTML editor for example (see here):
<?xml version="1.0" encoding="UTF-8" ?>
<!--
TinyMCE
-->
<anti-samy-rules xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="antisamy.xsd">
<directives>
<directive name="omitXmlDeclaration" value="true" />
<directive name="omitDoctypeDeclaration" value="false" />
<directive name="maxInputSize" value="100000" />
<directive name="embedStyleSheets" value="false" />
<directive name="useXHTML" value="true" />
<directive name="formatOutput" value="true" />
</directives>
<common-regexps>
<!--
From W3C:
This attribute assigns a class name or set of class names to an
element. Any number of elements may be assigned the same class
name or names. Multiple class names must be separated by white
space characters.
-->
<regexp name="htmlTitle" value="[a-zA-Z0-9\s\-_',:\[\]!\./\\\(\)&]*" />
<!-- force non-empty with a '+' at the end instead of '*'
-->
<regexp name="onsiteURL" value="([\p{L}\p{N}\p{Zs}/\.\?=&\-~])+" />
<!-- ([\w\\/\.\?=&;\#-~]+|\#(\w)+)
-->
<!-- ([\p{L}/ 0-9&\#-.?=])*
-->
<regexp name="offsiteURL" value="(\s)*((ht|f)tp(s?)://|mailto:)[A-Za-z0-9]+[~a-zA-Z0-9-_\.@\#\$%&;:,\?=/\+!\(\)]*(\s)*" />
</common-regexps>
<!--
Tag.name = a, b, div, body, etc.
Tag.action = filter: remove tags, but keep content, validate: keep content as long as it passes rules, remove: remove tag and contents
Attribute.name = id, class, href, align, width, etc.
Attribute.onInvalid = what to do when the attribute is invalid, e.g., remove the tag (removeTag), remove the attribute (removeAttribute), filter the tag (filterTag)
Attribute.description = What rules in English you want to tell the users they can have for this attribute. Include helpful things so they'll be able to tune their HTML
-->
<!--
Some attributes are common to all (or most) HTML tags. There aren't many that qualify for this. You have to make sure there's no
collisions between any of these attribute names with attribute names of other tags that are for different purposes.
-->
<common-attributes>
<attribute name="lang"
description="The 'lang' attribute tells the browser what language the element's attribute values and content are written in">
<regexp-list>
<regexp value="[a-zA-Z]{2,20}" />
</regexp-list>
</attribute>
<attribute name="title"
description="The 'title' attribute provides text that shows up in a 'tooltip' when a user hovers their mouse over the element">
<regexp-list>
<regexp name="htmlTitle" />
</regexp-list>
</attribute>
<attribute name="href" onInvalid="filterTag">
<regexp-list>
<regexp name="onsiteURL" />
<regexp name="offsiteURL" />
<!--
-->
</regexp-list>
</attribute>
<attribute name="align"
description="The 'align' attribute of an HTML element is a direction word, like 'left', 'right' or 'center'">
<literal-list>
<literal value="center" />
<literal value="left" />
<literal value="right" />
<literal value="justify" />
<literal value="char" />
</literal-list>
</attribute>
<attribute name="style"
description="The 'style' attribute provides the ability for users to change many attributes of the tag's contents using a strict syntax" />
</common-attributes>
<!--
This requires normal updates as browsers continue to diverge from the W3C and each other. As long as the browser wars continue
this is going to continue. I'm not sure war is the right word for what's going on. Doesn't somebody have to win a war after
a while?
-->
<global-tag-attributes>
<attribute name="title" />
<attribute name="lang" />
<attribute name="style" />
</global-tag-attributes>
<tags-to-encode>
<tag>g</tag>
<tag>grin</tag>
</tags-to-encode>
<tag-rules>
<!-- Remove -->
<tag name="script" action="remove" />
<tag name="noscript" action="remove" />
<tag name="iframe" action="remove" />
<tag name="frameset" action="remove" />
<tag name="frame" action="remove" />
<tag name="noframes" action="remove" />
<tag name="head" action="remove" />
<tag name="title" action="remove" />
<tag name="base" action="remove" />
<tag name="style" action="remove" />
<tag name="link" action="remove" />
<tag name="input" action="remove" />
<tag name="textarea" action="remove" />
<!-- Truncate -->
<tag name="br" action="truncate" />
<!-- Validate -->
<tag name="p" action="validate">
<attribute name="align" />
</tag>
<tag name="div" action="validate" />
<tag name="span" action="validate" />
<tag name="i" action="validate" />
<tag name="b" action="validate" />
<tag name="strong" action="validate" />
<tag name="s" action="validate" />
<tag name="strike" action="validate" />
<tag name="u" action="validate" />
<tag name="em" action="validate" />
<tag name="blockquote" action="validate" />
<tag name="tt" action="truncate" />
<tag name="a" action="validate">
<attribute name="href" onInvalid="filterTag" />
<attribute name="nohref">
<literal-list>
<literal value="nohref" />
<literal value="" />
</literal-list>
</attribute>
<attribute name="rel">
<literal-list>
<literal value="nofollow" />
</literal-list>
</attribute>
</tag>
<!-- List tags
-->
<tag name="ul" action="validate" />
<tag name="ol" action="validate" />
<tag name="li" action="validate" />
<tag name="dl" action="validate" />
<tag name="dt" action="validate" />
<tag name="dd" action="validate" />
</tag-rules>
<css-rules>
<property name="text-decoration" default="none"
description="">
<category-list>
<category value="visual" />
</category-list>
<literal-list>
<literal value="underline" />
<literal value="overline" />
<literal value="line-through" />
</literal-list>
</property>
</css-rules>
</anti-samy-rules>
This policy is designed to sanitize the HTML entered in an HTML editor, so it may fit your needs. You can also find more policies here.
If you use Java, you can implement that policy through the java-html-sanitizer, or you can also define a custom one.