I'm using antisamy for a new project, to prevent the xss vunerabilities. In the application, a user can upload content by an simple (ansi encoded) excel file. It should be possible to insert some html but no possible insert javascript and so on.
when I scan my input with antisamy, i get this errors: The a tag contained an attribute that we could not process. The href attribute had a value of "https& # 5 8 ;& # 4 7 ;& # 4 7 ;bla& # 4 6 ;bla& # 4 6 ;com& # 4 7 ;bla& # 4 7 ;...". This value could not be accepted for security reasons. We have chosen to filter the a tag in order to continue processing the input.
I added some spaces in the entities, that you can see it.
(But it should be like https://bla.bla.com/bla/...)
When i debug thru the code, the 'tainted Html input' and its href attribute seems to be right (so there is no problem with encoding of the excel file).
The antisamy-policy file looks like this:
...
<regexp name="onsiteURL" value="([\w\\/\.\?=&;#-~]+|#(\w)+)"/>
<regexp name="offsiteURL" value="(\s)*((ht|f)tp(s?)://|mailto:)[A-Za-z0-9]+[~a-zA-Z0-9-_\.@#$%&;:,\?=/\+!]*(\s)*"/>
...
<attribute name="href">
<regexp-list>
<regexp name="onsiteURL"/>
<regexp name="offsiteURL"/>
</regexp-list>
<literal-list>
<literal value="javascript:void(0)"/>
</literal-list>
</attribute>
...
I also tested the regex pattern and as i thought the link was valid. Of course not, when it's encoded by html entities.
So what's the problem?
Thanks a lot in advance
I debugged thru the AntiSamy Code a bit and now I see the problem, but I sill can't fix the problem. The htmlentities were added by antisamy AFTER validation (if I would print it on an HTML page..). But my input will be parsed by org.cyberneko.html.parsers.DOMFragmentParser in the AntiSamy library with this statement: parser.parse(new InputSource(new StringReader(html)), dom); in my a tag the href attribute contains now something like this: https://bla.bla.com/bla?frame=Frameset[undefinable character]lang=en insted of https://bla.bla.com/bla?frame=Frameset&lang=en
So it seems to be an encoding problem, that the ampersand will not be an ampersand anymore. How could find out, which encoding i should use?
Edit: The character is E2 8C A9 -> ⟨
I've done a little workaround by replacing the "&" with "& ;". I don't know why, but it works. And this is the only character which doesn't work properly.