I have a few HTML forms, and I am implementing filtering of these fields on the server-side (using Java Servlets), and I was wondering what I should allow, or perhaps what I should disallow. For e-mail addresses I remove anything that matches this:
[^A-Za-z0-9._%-@]
What are some similar rules I could apply to name, message and phone number fields.
I'm assuming that < and > should be escaped as < and >, what else should I replace?
Along those lines, are there any recommendations for the maximum length allowed for such fields?
You need to escape &
to &
first, then <
to <
. Contrary to popular belief, it is not necessary to escape >
to >
. There is no need to protect the bracket that closes an HTML tag if there is no way to open one.
Your call on whether it should be escaped before being written to the database, or whether you should do it as it's read from the database each time. Doing it on the input side is going to be faster; doing it on the output side is going to be more secure and also make interchanging data with other apps easier if you don't have to always unescape stuff before sending it off to another app. I personally would pay the performance price and unescape on the output side. Caching can help.
The rest of the validation you'll want to do depends on the type of data. For an e-mail address, check to make sure it has an @
and at least one .
after that, then, if you care whether it's valid or not, send the address a test e-mail. It is next to impossible to completely validate an e-mail address much further than that, and even if the address is syntactically valid, that still doesn't mean it can be delivered. Similarly, allow almost anything as a URL and then try to retrieve it to see if it's valid. For a billing/shipping address, use the USPS Web service to validate and get the data in the best format (for U.S. addresses).