regexiframesanitizationmarkdownsharp

Validating that an iframe src has a specific url with regex


I'm in the middle of integrating MarkdownSharp, a serverside Markdown compilation library. I have that working, but now I need to sanitize the generated html.

I took a look at the Stack Exchange Data Explorer source code to see how they sanitize their html, and see that they use the following regex to sanitize images post-conversion:

private static readonly Regex _whitelist_img =
        new Regex(
            @"
        ^<img\s
        src=""https?://[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+""
        (\swidth=""\d{1,3}"")?
        (\sheight=""\d{1,3}"")?
        (\salt=""[^""<>]*"")?
        (\stitle=""[^""<>]*"")?
        \s?/?>$",
            RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled |
            RegexOptions.IgnorePatternWhitespace);

I've been wrestling with how to do write an analagous regex for whitelist_iframe - that ensures that the iframe contains a link from youtube or vimeo. The following links are examples of what I'd like to embed:

<iframe width="560" height="315" src="//www.youtube.com/embed/IZ_ScEebDOM?rel=0" frameborder="0" allowfullscreen></iframe>


<iframe src="//player.vimeo.com/video/80825843?title=0&amp;byline=0&amp;portrait=0" width="500" height="281" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>

So I believe the above needs to be modified to:

  1. Account for // instead of http or https
  2. Account for </iframe> closing tag
  3. Account for //www.youtube.com or //player.vimeo.com being required in the beginning of the src tag.

I'm in the middle of butchering this up as my first regex... any help with this would be much appreciated.

Note that I am not looking to introduce additional libraries or complexity here with a better overall approach, I just want to augement code that's already working, with regex.


Solution

  • As a beginner in regexes, I can only warn you that this is a slippery slope. Simple constructs are easy to match, but regexes and HTML do not mix well. I know that it's been done, but you need to be something of an expert to know when it's a good idea and when it's not. As a self-proclaimed beginner, I suggest you pick up a copy of Friedl's "Mastering Regular Expressions" and read at least the first few chapters before you begin using them. (That's what I did.)

    1. Account for // instead of http or https

      Remove the "https?:" from the existing regex:

      src=""//[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+""
      
    2. Account for closing tag

      Add the closing tag after the end of your input:

      \s?/?></iframe>$
      
    3. Account for //www.youtube.com or //player.vimeo.com being required in the beginning of the src tag.

      Add the desired domains in a selection list:

      src=""//(www.youtube.com|player.vimeo.com)/[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+""