javascriptphphtmlregexhtmlpurifier

Removing BBCode URL tag from string


Im trying to make a stable system that will allow users to paste any mixture of BB / Html code into an input and i will sanitize and strip the data AS I WANT.

The content is copied from forums and the issue is that they all seems to use different code. Some display more than one
some use a self closing br tag. Others use a [URL =] And other just use [URL]URL[/URL] etc.

So far, I use HTMLpurifier to strip everything except for img tags.

HTMLpurifier doesnt (as far as i can see) remove BBCode. So, given a string like so:

[URL=http://awebsite.com]My Link [IMG]imagelink.png[/IMG][/URL]

How can i remove the URL tags and just leave the IMG tags.

I want to remove all the URL tag options so the url given and the text as well which may prove difficult.

So far i have got quite far by converting [IMG] tags etc using REGEX which works but i feel there are too many variants to hardcode this.

Any suggestions on a more efficient way / possible way to remove the URL tags?


Solution

  • Option 1

    If you just want to remove tags such as [URL=http://awebsite.com] and [/URL], leaving the content inside, the regex is simple:

    Search: \[/?URL[^\]]*\]

    Replace: Empty string

    In JavaScript

    replaced = string.replace(/\[\/?URL[^\]]*\]/g, "");
    

    In PHP

    $replaced = preg_replace('%\[/?URL[^\]]*\]%', '', $str);
    

    Option 2: Also Removing content such as MyLink

    Here, we'll replace the content following [URL...] that is not another tag.

    Search: \[URL[^\]]*\][^\[\]]*|\[/URL[^\]]*\]

    Replace: Empty string

    JavaScript:

    replaced = string.replace(/\[URL[^\]]*\][^\[\]]*|\[\/URL[^\]]*\]/g, "");
    

    PHP:

    $replaced = preg_replace('%\[URL[^\]]*\][^\[\]]*|\[/URL[^\]]*\]%', '', $str);