I want to replace square-braced image placeholders with valid HTML markup.
A sample placeholder might look like this:
[img:http://example.com/_data/025_img.jpg]
I want is to change the bit where it says [img: ... ]
with <img>
tag and get a result like this:
<img src='http://example.com/_data/025_img.jpg' border='0' />
Additional information about user uploaded images relevant to this task:
[img: ... ]
where ...
is the link that would be copied upon clicking on the images which are listed from the user gallery.[img: ... ]
tag and exchange that into a <img>
tag and render the post with images followed by text.So the actual input from the user will be something along the lines of:
The brown fox jumped over foo bar [img:http://example.com/_data/025_img.jpg] and then went to bed [img:http://example.com/_data/0277_img.jpg] while thinking about [img:http://example.com/_data/1115_img.jpg]
That is the reason I asked for preg_replace()
, rather than preg_match()
. preg_match()
doesn't make the text follow the images.
Let's get the easy thing out of the way first.
/\[img:([^\]]+)\]/
That is:
[img:
]
]
Run this through preg_match
and element 1 in the match array will very likely be an image URL that you can easily insert into an img
tag.
But you shouldn't. Not right away.
First, this is insecure as heck. What's going to happen when I write this?
[img:javascript:alert(document.cookie);]
Uhoh. That's not going to be good.
You're probably going to want to make sure that the thing that the user claims is a URL really is a URL. You can try doing this by calling parse_url
. It will give you back an array of URL components. Make sure that the thing has a domain and a path, and is served over HTTP or HTTPS.
Okay, but what happens when the user enters this?
[img:http://www.example.com/foo.jpg" onmouseover="alert(document.cookie)"]
That's a valid...ish... URL that will be successfully deconstructed by parse_url
and may well pass basic checks for well-formedness. Filtering out spaces and quotes (single and double) will be a good starting point, but there are still more things to worry about.
The bottom line is that markup like this is a vector in XSS, or Cross-site scripting vulnerabilities.
You can probably mitigate some of the threat by passing the URL through htmlspecialchars
. That will at least nuke quotes and brackets, and it's hard to be nasty with those taken care of. Just watch out for character set sillyness, some non-UTF-8 character encoding can include things that are ASCII quotes...
You probably want to use a real markup language for this (even if it's just markdown), and you probably want to use a whitelist-based HTML filter like HTML Purifier on the result. This will help protect you from some levels of insanity.
Remember, you're only paranoid if they aren't out to get you. The web is full of people that are so stupid that they're malicious, and people that are so malicious that it's stupid.