I know this topic has been discussed quite extensively, as I've gone through and read more than 15 posts on the subject, but still can't find an answer to my question.
I'm looking for a function to sanitize data from a form. As absolutely NO HTML will be acceptable, how do I go about escaping ALL html entities so the user can absolutely not inject anything? I don't need a white list, as no input HTML is allowed.
Also, there's no need to run the mysql_real_escape_string, as I don't utilize a MySQL database. I use MongoDB. I'm just storing first name, last name, phone numbers, basic stuff. No HTML. But I still don't want a user to be able to input <script>whatever</script>
for their first name, and when it's displayed back to them, it parses it.
I thought about HTML Purifier, and htmLAWED but they seem to be too much for what I'm trying to do. Do I just build a fancy preg_replace function?
There is no universal "make it safe" filter. Strings are only dangerous when placed into a specific context.
For example, if the context is a plain text document, you don't really have any worries.
htmlspecialchars is enough if the context is a text node(not within angle brackets). Specify the correct charset/encoding, which is the charset/encoding in the http headers sent by your server.
ok
<p><?= htmlspecialchars($input, ENT_QUOTES, 'UTF-8'); ?></p>
But, if you need to output inside of angle brackets, making the context something like html attributes, like:
<p <?= htmlspecialchars($input, ENT_QUOTES, 'UTF-8'); ?> ></p>
or
<p title="<?= htmlspecialchars($input, ENT_QUOTES, 'UTF-8'); ?>" ></p>
The "make it safe" task, in many cases, becomes extremely difficult(legacy browsers have some absolutely bewildering bugs that defy common expectations of software developers). You would be foolish to not stand on the shoulders of giants and make use of something like htmlpurifier.